使用字典将多个字符串包含过滤器应用于pandas数据框 [英] Apply multiple string containment filters to pandas dataframe using dictionary
问题描述
我需要根据字符串包含度在多列上设置一个过滤器,这将在字典 column_filters
中指定,同时忽略使用 toupper()
或类似内容的文本大小写线...例如
I need to set a filter on multiple columns based on string containment which will be specified in the dict column_filters
while ignoring text case using toupper()
or something along those lines ... for example
column_filters = {'COLUMN_1': ['drum', 'gui'], 'COLUMN_2': ['sta', 'kic']}
df = pd.DataFrame({'COLUMN_1': ['DrumSet', 'GUITAR', 'String', 'Bass', 'Violin'],
'COLUMN_2': ['STAND', 'DO', 'KICKSET', 'CAT', 'CELLO'],
'COLUMN_3': ['LOSER', 'LOVE', 'LICKING', 'STICK', 'BOLOGNA'])
要根据 COLUMN_FILTERS
字典进行过滤的DataFrame:
DataFrame to filter based On COLUMN_FILTERS
dict:
COLUMN_1 COLUMN_2 COLUMN_3
0 DrumSet STAND LOSER
1 GUITAR DO LOVE
2 String KICKSET LICKING
3 Bass CAT STICK
4 Violin CELLO BOLOGNA
结果:
COLUMN_1 COLUMN_2 COLUMN_3
0 DrumSet STAND LOSER
1 GUITAR DO LOVE
2 String KICKSET LICKING
推荐答案
通过将所有字符串与'|'
连接起来,我会将dict值转换为regex模式,然后可以使用 str.contains
来过滤df:
I'd convert the dict values into a regex pattern by joining all strings with '|'
, you can then use str.contains
to filter the df:
In [50]:
for k in column_filters.keys():
column_filters[k] = '|'.join(column_filters[k])
column_filters
Out[50]:
{'COLUMN_1': 'drum|gui', 'COLUMN_2': 'sta|kic'}
现在使用带有参数 case = False
的 str.contains
进行过滤:
now filter using using str.contains
with param case=False
:
In [51]:
df.loc[(df['COLUMN_1'].str.contains(column_filters['COLUMN_1'], case=False)) | (df['COLUMN_2'].str.contains(column_filters['COLUMN_2'], case=False))]
Out[51]:
COLUMN_1 COLUMN_2
0 DrumSet STAND
1 GUITAR DO
2 String KICKSET
更新
好的,有一个动态方法:
OK there is a dynamic method:
In [68]:
df[df.apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False)).any(axis=1)]
Out[68]:
COLUMN_1 COLUMN_2
0 DrumSet STAND
1 GUITAR DO
2 String KICKSET
我们可以看到没有正确匹配的布尔掩码:
We can see without the boolean masking that it correctly matches:
In [69]:
df.apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False))
Out[69]:
COLUMN_1 COLUMN_2
0 True True
1 True False
2 False True
3 False False
4 False False
更新2
要再次回答您修改的问题:
To answer you modified question again:
In [75]:
df[df[list(column_filters.keys())].apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False)).any(axis=1)]
Out[75]:
COLUMN_1 COLUMN_2 COLUMN_3
0 DrumSet STAND LOSER
1 GUITAR DO LOVE
2 String KICKSET LICKING
这篇关于使用字典将多个字符串包含过滤器应用于pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!