搜索 pandas DataFrame的任何行中包含的文本 [英] Search for text contained in any row of a pandas DataFrame

查看:112
本文介绍了搜索 pandas DataFrame的任何行中包含的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 DataFrame

pred[['right_context', 'PERC']]
Out[247]: 
                          right_context      PERC
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197
1                San Pedro xxxxxxxxxxxx  0.572630
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630
3             de San Pedro Este parcela  0.572630
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577

我还有另一只熊猫 DataFrame 称为 _direcciones ,具有真实地址:

And I have another pandas DataFrame called _direcciones with real addresses:

388427          SAN PEDRO              1
388428     bbbbbbbbbbbbbbbbbbbbbb      1
388429        yyyyyyyyyyyyyyyyyyy      1
[388430 rows x 2 columns]

我需要以某种方式搜索第一个<$ c中是否包含 _direcciones 中的某些地址 $ c> DataFrame ,我所做的是:

I need to somehow search if some address in _direcciones is contained in the first DataFrame, What I have done is:

[True for y in pred.right_context 
   for x in _direcciones.entity_content 
   if re.match(r'^%s\b' %x, y, flags=re.I)]

但这很慢,更重要的是,我想在第一个 DataFrame 值为 True | False 的列(如果找到了地址),但是目前我不能,因为上面的代码可以返回任意数量的行,不完全是 5 ,就像我需要第一个 DataFrame

But it is very slow, and, more importantly, I would like to append to the first DataFrame a column with values True|False if an address was found, but currently I can't because the above code can return any number of rows, not exactly 5, like I would need for the first DataFrame.

类似这样的东西:

pred[['right_context', 'PERC']]
Out[247]: 
                          right_context      PERC    found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197       F
1                San Pedro xxxxxxxxxxxx  0.572630       T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630       F
3             de San Pedro Este parcela  0.572630       T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577       F



更新



感谢您的回答,但我正面临相同的问题 _direcciones 太大,以至于 pred.right_context 中存在<$ c $的词的机会c> _direcciones 很高。例如:

Update

Thanks for the answers, but I am facing the same issue, _direcciones is so large that the chances that in pred.right_context exists a word in _direcciones are very high. For example:

0    URBANA. OBRA NUEVA TERMINADA. Urbana
1                  San Pedro número xxxxx

在这里,我正在寻找 San Pedro ,但 San Pedro URBANA 都在 _direcciones ,因此两行均为 True 。我不知道如何解决该问题。

Here, I am looking for San Pedro, but both San Pedro and URBANA are in _direcciones, so both rows will be True. I do not know how to approach the problem.

推荐答案

Series.str.contains & ; str。上层



您不能使用 Series.str.contains 并加入 _direcciones 中的列作为一个字符串,以 | 作为分隔符。

Series.str.contains & str.upper

You cann use Series.str.contains and join the column in _direcciones as one string with | as seperator.

还要注意,我们必须将数据帧 pred 的字符串强制转换为大写的 str.upper

Also important to note that we have to cast the string of dataframe pred to uppercase with str.upper

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))







print(pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197   False
1                San Pedro xxxxxxxxxxxx  0.572630    True
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630   False
3             de San Pedro Este parcela  0.572630    True
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577   False



仅获得 T & F



Only get T & F

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))\
                                      .astype(str).str[:1]







print(pred)
                          right_context      PERC found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197      F
1                San Pedro xxxxxxxxxxxx  0.572630      T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630      F
3             de San Pedro Este parcela  0.572630      T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577      F



'|'.join的输出



Output of '|'.join

'|'.join(_direcciones['Address'])

'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'

这篇关于搜索 pandas DataFrame的任何行中包含的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆