搜索 pandas DataFrame的任何行中包含的文本 [英] Search for text contained in any row of a pandas DataFrame

查看：112 发布时间：2020/10/17 2:48:31 python pandas dataframe

本文介绍了搜索 pandas DataFrame的任何行中包含的文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下 DataFrame

pred[['right_context', 'PERC']]
Out[247]: 
                          right_context      PERC
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197
1                San Pedro xxxxxxxxxxxx  0.572630
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630
3             de San Pedro Este parcela  0.572630
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577

我还有另一只熊猫 DataFrame 称为 _direcciones ，具有真实地址：

And I have another pandas DataFrame called _direcciones with real addresses:

388427          SAN PEDRO              1
388428     bbbbbbbbbbbbbbbbbbbbbb      1
388429        yyyyyyyyyyyyyyyyyyy      1
[388430 rows x 2 columns]

我需要以某种方式搜索第一个<$ c中是否包含 _direcciones 中的某些地址 $ c> DataFrame ，我所做的是：

I need to somehow search if some address in _direcciones is contained in the first DataFrame, What I have done is:

[True for y in pred.right_context 
   for x in _direcciones.entity_content 
   if re.match(r'^%s\b' %x, y, flags=re.I)]

但这很慢，更重要的是，我想在第一个 DataFrame 值为 True | False 的列（如果找到了地址），但是目前我不能，因为上面的代码可以返回任意数量的行，不完全是 5 ，就像我需要第一个 DataFrame 。

But it is very slow, and, more importantly, I would like to append to the first DataFrame a column with values True|False if an address was found, but currently I can't because the above code can return any number of rows, not exactly 5, like I would need for the first DataFrame.

类似这样的东西：

pred[['right_context', 'PERC']]
Out[247]: 
                          right_context      PERC    found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197       F
1                San Pedro xxxxxxxxxxxx  0.572630       T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630       F
3             de San Pedro Este parcela  0.572630       T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577       F

更新

感谢您的回答，但我正面临相同的问题 _direcciones 太大，以至于 pred.right_context 中存在<$ c $的词的机会c> _direcciones 很高。例如：

Update

Thanks for the answers, but I am facing the same issue, _direcciones is so large that the chances that in pred.right_context exists a word in _direcciones are very high. For example:

0    URBANA. OBRA NUEVA TERMINADA. Urbana
1                  San Pedro número xxxxx

在这里，我正在寻找 San Pedro ，但 San Pedro 和 URBANA 都在 _direcciones ，因此两行均为 True 。我不知道如何解决该问题。

Here, I am looking for San Pedro, but both San Pedro and URBANA are in _direcciones, so both rows will be True. I do not know how to approach the problem.

`Series.str.contains` & ; `str。上层`

您不能使用 Series.str.contains 并加入 _direcciones 中的列作为一个字符串，以 | 作为分隔符。

`Series.str.contains` & `str.upper`

You cann use Series.str.contains and join the column in _direcciones as one string with | as seperator.

还要注意，我们必须将数据帧 pred 的字符串强制转换为大写的 str.upper

Also important to note that we have to cast the string of dataframe pred to uppercase with str.upper

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))

print(pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197   False
1                San Pedro xxxxxxxxxxxx  0.572630    True
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630   False
3             de San Pedro Este parcela  0.572630    True
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577   False

仅获得 `T` & `F`

Only get `T` & `F`

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))\
                                      .astype(str).str[:1]

print(pred)
                          right_context      PERC found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197      F
1                San Pedro xxxxxxxxxxxx  0.572630      T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630      F
3             de San Pedro Este parcela  0.572630      T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577      F

`'|'.join的输出`

Output of `'|'.join`

'|'.join(_direcciones['Address'])

'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'

这篇关于搜索 pandas DataFrame的任何行中包含的文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

搜索 pandas DataFrame的任何行中包含的文本 [英] Search for text contained in any row of a pandas DataFrame

问题描述

更新

Update

推荐答案

`Series.str.contains` & ; `str。上层`

`Series.str.contains` & `str.upper`

仅获得 `T` & `F`

Only get `T` & `F`

`'|'.join的输出`

Output of `'|'.join`

相关文章

Python最新文章

热门教程

热门工具

登录关闭

搜索 pandas DataFrame的任何行中包含的文本 [英] Search for text contained in any row of a pandas DataFrame

问题描述

更新

Update

推荐答案

Series.str.contains & ; str。上层

Series.str.contains & str.upper

仅获得 T & F

Only get T & F

'|'.join的输出

Output of '|'.join

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

`Series.str.contains` & ; `str。上层`

`Series.str.contains` & `str.upper`

仅获得 `T` & `F`

Only get `T` & `F`

`'|'.join的输出`

Output of `'|'.join`

登录关闭