检查pandas数据框中是否有多个子字符串 [英] Check if multiple substrings are in pandas dataframe

查看:689
本文介绍了检查pandas数据框中是否有多个子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,我想检查某个列的子字符串. 目前,我有30行此类代码:

I have a pandas dataframe which I want to check for substrings of a certain column. At the moment I have 30 lines of code of this kind:

df['NAME'].str.upper().str.contains('LIMITED')) |
(df['NAME'].str.upper().str.contains('INC')) |
(df['NAME'].str.upper().str.contains('CORP')) 

它们都以or条件链接,如果其中任何一个为true,则名称是公司的名称,而不是个人的名称.

They are all linked with an or condition and if any of them is true, the name is the name of a company rather than a person.

但是对我来说,这似乎不是很优雅.有没有一种方法可以检查pandas字符串列中的此列中的字符串是否包含以下列表中的任何子字符串"?['LIMITED', 'INC', 'CORP'].

But to me this doesn't seem very elegant. Is there a way to check a pandas string column for "does the string in this column contain any of the substrings in the following list" ['LIMITED', 'INC', 'CORP'].

我找到了pandas.DataFrame.isin函数,但这仅适用于整个字符串,不适用于我的子字符串.

I found the pandas.DataFrame.isin function, but this is only working for entire strings, not for my substrings.

推荐答案

您可以使用正则表达式,其中"|"是正则表达式中的或":

You can use regex, where '|' is an "or" in regular expressions:

l = ['LIMITED','INC','CORP']  
regstr = '|'.join(l)
df['NAME'].str.upper().str.contains(regstr)

MVCE:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'NAME':['Baby CORP.','Baby','Baby INC.','Baby LIMITED
   ...: ']})

In [3]: df
Out[3]: 
           NAME
0    Baby CORP.
1          Baby
2     Baby INC.
3  Baby LIMITED

In [4]: l = ['LIMITED','INC','CORP']  
   ...: regstr = '|'.join(l)
   ...: df['NAME'].str.upper().str.contains(regstr)
   ...: 
Out[4]: 
0     True
1    False
2     True
3     True
Name: NAME, dtype: bool

In [5]: regstr
Out[5]: 'LIMITED|INC|CORP'

这篇关于检查pandas数据框中是否有多个子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆