通过查找字符串列中的确切单词(未组合)来过滤DataFrame [英] Filtering DataFrame by finding exact word (not combined) in a column of strings
问题描述
我的DataFrame有两列:
My DataFrame has two columns:
Name Status
a I am Good
b Goodness!!!
c Good is what i feel
d Not Good-at-all
我想过滤状态"中包含字符串"Good"作为其确切单词的行,而不将其与任何其他单词或字符组合.
I want to filter rows in which Status has a string 'Good' as its exact word, not combined with any other words or characters.
所以输出将是:
Name Status
a i am Good
c Good is what i feel
另外两行中有一个'Good'字符串,但与其他字符混合在一起,因此不应被提取.
Two other rows had a 'Good' string in it but mixed with other characters, so should not be picked up.
我尝试做:
d = df[df['Status'].str.contains('Good')] # But all rows come up
我相信像(r'\bGood\b', Status)
这样的正则表达式可以做到这一点,但这无法将其总结在一起.以及如何/在哪里可以将正则表达式完全适合DataFrame过滤器条件以实现此目的?以及如何实现startswith
或endswith
'良好'(精确单词搜索)?
I believe some regex like (r'\bGood\b', Status)
will do that, but this is not able to sum it up together. And how/where exactly can I fit the regex in a DataFrame filter condition to achieve this? And how to achieve startswith
or endswith
'Good' (exact word search)?
推荐答案
如果要定义精确"以表示没有其他字符(包括定义单词边界的标点符号\b
),则可以改为检查前导和尾随空格和/或开始/结束锚点:
If you're defining "exact" to mean no other characters (including punctuation which defines a word boundary \b
), you could instead check for a leading and trailing space and/or beginning/end anchors:
>>> df[df['Status'].str.contains(r'(?:\s|^)Good(?:\s|$)')]
Name Status
0 a I am Good
2 c Good is what i feel
说明:
-
(?:\s|^)
是一个非捕获组,正在寻找空格字符(\s
)或字符串的开头(^
).
(?:\s|^)
is a non-capturing group looking for a space character (\s
) or the beginning of the string (^
).
Good
是您要查找的单词.
(?:\s|$)
是一个非捕获组,正在寻找空格字符(\s
)或字符串结尾($
).
(?:\s|$)
is a non-capturing group looking for a space character (\s
) or the end of the string ($
).
这篇关于通过查找字符串列中的确切单词(未组合)来过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!