搜索 pandas DataFrame的任何行中包含的文本 [英] Search for text contained in any row of a pandas DataFrame
问题描述
我有以下 DataFrame
pred[['right_context', 'PERC']]
Out[247]:
right_context PERC
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197
1 San Pedro xxxxxxxxxxxx 0.572630
2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630
3 de San Pedro Este parcela 0.572630
4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577
我还有另一只熊猫 DataFrame
称为 _direcciones
,具有真实地址:
And I have another pandas DataFrame
called _direcciones
with real addresses:
388427 SAN PEDRO 1
388428 bbbbbbbbbbbbbbbbbbbbbb 1
388429 yyyyyyyyyyyyyyyyyyy 1
[388430 rows x 2 columns]
我需要以某种方式搜索第一个<$ c中是否包含 _direcciones
中的某些地址 $ c> DataFrame ,我所做的是:
I need to somehow search if some address in _direcciones
is contained in the first DataFrame
, What I have done is:
[True for y in pred.right_context
for x in _direcciones.entity_content
if re.match(r'^%s\b' %x, y, flags=re.I)]
但这很慢,更重要的是,我想在第一个 DataFrame
值为 True | False
的列(如果找到了地址),但是目前我不能,因为上面的代码可以返回任意数量的行,不完全是 5
,就像我需要第一个 DataFrame
。
But it is very slow, and, more importantly, I would like to append to the first DataFrame
a column with values True|False
if an address was found, but currently I can't because the above code can return any number of rows, not exactly 5
, like I would need for the first DataFrame
.
类似这样的东西:
pred[['right_context', 'PERC']]
Out[247]:
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 F
1 San Pedro xxxxxxxxxxxx 0.572630 T
2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630 F
3 de San Pedro Este parcela 0.572630 T
4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577 F
更新
感谢您的回答,但我正面临相同的问题 _direcciones
太大,以至于 pred.right_context
中存在<$ c $的词的机会c> _direcciones 很高。例如:
Update
Thanks for the answers, but I am facing the same issue, _direcciones
is so large that the chances that in pred.right_context
exists a word in _direcciones
are very high. For example:
0 URBANA. OBRA NUEVA TERMINADA. Urbana
1 San Pedro número xxxxx
在这里,我正在寻找 San Pedro
,但 San Pedro
和 URBANA
都在 _direcciones
,因此两行均为 True
。我不知道如何解决该问题。
Here, I am looking for San Pedro
, but both San Pedro
and URBANA
are in _direcciones
, so both rows will be True
. I do not know how to approach the problem.
推荐答案
Series.str.contains
& ; str。上层
您不能使用 Series.str.contains
并加入 _direcciones
中的列作为一个字符串,以 |
作为分隔符。
Series.str.contains
& str.upper
You cann use Series.str.contains
and join the column in _direcciones
as one string with |
as seperator.
还要注意,我们必须将数据帧 pred
的字符串强制转换为大写的 str.upper
Also important to note that we have to cast the string of dataframe pred
to uppercase with str.upper
pred['found?'] = pred['right_context'].str.upper()\
.str.contains('|'.join(_direcciones['Address']))
print(pred)
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 False
1 San Pedro xxxxxxxxxxxx 0.572630 True
2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630 False
3 de San Pedro Este parcela 0.572630 True
4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577 False
仅获得 T
& F
Only get T
& F
pred['found?'] = pred['right_context'].str.upper()\
.str.contains('|'.join(_direcciones['Address']))\
.astype(str).str[:1]
print(pred)
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 F
1 San Pedro xxxxxxxxxxxx 0.572630 T
2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630 F
3 de San Pedro Este parcela 0.572630 T
4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577 F
'|'.join的输出
Output of '|'.join
'|'.join(_direcciones['Address'])
'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'
这篇关于搜索 pandas DataFrame的任何行中包含的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!