如何在大 pandas 中测试字符串是否包含列表中的子字符串之一? [英] How to test if a string contains one of the substrings in a list, in pandas?
问题描述
是否有任何功能等同于df.isin()
和df[col].str.contains()
的组合?
Is there any function that would be the equivalent of a combination of df.isin()
and df[col].str.contains()
?
例如,说我有系列
s = pd.Series(['cat','hat','dog','fog','pet'])
,我想查找s
包含['og', 'at']
中任何一个的所有地方,我想获取除"pet"以外的所有内容.
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet'])
, and I want to find all places where s
contains any of ['og', 'at']
, I would want to get everything but 'pet'.
我有一个解决方案,但这很不雅致:
I have a solution, but it's rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
有更好的方法吗?
推荐答案
一种选择就是使用正则表达式|
字符尝试匹配系列s
中单词中的每个子字符串(仍然使用str.contains
).
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
您可以通过将searchfor
中的单词与|
连接起来来构造正则表达式:
You can construct the regex by joining the words in searchfor
with |
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
正如@AndyHayden在下面的注释中指出的那样,请注意您的子字符串是否具有特殊字符(例如$
和^
),这些字符要按字面值进行匹配.这些字符在正则表达式的上下文中具有特定的含义,并且会影响匹配.
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
通过使用re.escape
转义非字母数字字符,可以使子字符串列表更安全:
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
与str.contains
一起使用时,此新列表中带有的字符串将逐字匹配每个字符.
The strings with in this new list will match each character literally when used with str.contains
.
这篇关于如何在大 pandas 中测试字符串是否包含列表中的子字符串之一?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!