如何在 pandas 中测试字符串是否包含列表中的子字符串之一? [英] How to test if a string contains one of the substrings in a list, in pandas?

查看:27
本文介绍了如何在 pandas 中测试字符串是否包含列表中的子字符串之一?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何函数等同于 df.isin()df[col].str.contains() 的组合?

例如,假设我有系列s = pd.Series(['cat','hat','dog','fog','pet']),我想找到所有s 包含任何 ['og', 'at'],我想获得除 'pet' 以外的所有内容.

我有一个解决方案,但它相当不雅:

searchfor = ['og', 'at']found = [s.str.contains(x) for x in searchfor]结果 = pd.DataFrame[找到]结果.any()

有没有更好的方法来做到这一点?

解决方案

一种选择是使用正则表达式 | 字符来尝试匹配系列 s(仍在使用 str.contains).

您可以通过将 searchfor 中的单词与 | 连接来构建正则表达式:

<预><代码>>>>searchfor = ['og', 'at']>>>s[s.str.contains('|'.join(searchfor))]0 猫1顶帽子2 狗3雾数据类型:对象

正如@AndyHayden 在下面的评论中所指出的,请注意您的子字符串是否具有您想要逐字匹配的特殊字符,例如 $^.这些字符在正则表达式的上下文中具有特定的含义,会影响匹配.

您可以通过使用 re.escape 转义非字母数字字符来使子字符串列表更安全:

<预><代码>>>>进口重新>>>匹配 = ['$money', 'x^y']>>>safe_matches = [re.escape(m) for m in match]>>>安全匹配['\$money', 'x\^y']

当与 str.contains 一起使用时,这个新列表中的字符串将逐字匹配每个字符.

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.

I have a solution, but it's rather inelegant:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

解决方案

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\$money', 'x\^y']

The strings with in this new list will match each character literally when used with str.contains.

这篇关于如何在 pandas 中测试字符串是否包含列表中的子字符串之一?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆