使用字典将多个字符串包含过滤器应用于pandas数据框 [英] Apply multiple string containment filters to pandas dataframe using dictionary

查看:36
本文介绍了使用字典将多个字符串包含过滤器应用于pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据字符串包含度在多列上设置一个过滤器,这将在字典 column_filters 中指定,同时忽略使用 toupper()或类似内容的文本大小写线...例如

I need to set a filter on multiple columns based on string containment which will be specified in the dict column_filters while ignoring text case using toupper() or something along those lines ... for example

column_filters = {'COLUMN_1': ['drum', 'gui'], 'COLUMN_2': ['sta', 'kic']}

df = pd.DataFrame({'COLUMN_1': ['DrumSet', 'GUITAR', 'String', 'Bass', 'Violin'],
                   'COLUMN_2': ['STAND', 'DO', 'KICKSET', 'CAT', 'CELLO'],
                   'COLUMN_3': ['LOSER', 'LOVE', 'LICKING', 'STICK', 'BOLOGNA'])

要根据 COLUMN_FILTERS 字典进行过滤的DataFrame:

DataFrame to filter based On COLUMN_FILTERS dict:

         COLUMN_1   COLUMN_2    COLUMN_3
      0 DrumSet      STAND       LOSER
      1 GUITAR       DO          LOVE
      2 String       KICKSET     LICKING
      3 Bass         CAT         STICK
      4 Violin       CELLO       BOLOGNA

结果:

    COLUMN_1    COLUMN_2     COLUMN_3
0   DrumSet      STAND       LOSER
1   GUITAR       DO          LOVE
2   String       KICKSET     LICKING

推荐答案

通过将所有字符串与'|'连接起来,我会将dict值转换为regex模式,然后可以使用 str.contains 来过滤df:

I'd convert the dict values into a regex pattern by joining all strings with '|', you can then use str.contains to filter the df:

In [50]:
for k in column_filters.keys():
    column_filters[k] = '|'.join(column_filters[k])
column_filters

Out[50]:
{'COLUMN_1': 'drum|gui', 'COLUMN_2': 'sta|kic'}

现在使用带有参数 case = False str.contains 进行过滤:

now filter using using str.contains with param case=False:

In [51]:
df.loc[(df['COLUMN_1'].str.contains(column_filters['COLUMN_1'], case=False)) | (df['COLUMN_2'].str.contains(column_filters['COLUMN_2'], case=False))]

Out[51]:
  COLUMN_1 COLUMN_2
0  DrumSet    STAND
1   GUITAR       DO
2   String  KICKSET

更新

好的,有一个动态方法:

OK there is a dynamic method:

In [68]:
df[df.apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False)).any(axis=1)]

Out[68]:
  COLUMN_1 COLUMN_2
0  DrumSet    STAND
1   GUITAR       DO
2   String  KICKSET

我们可以看到没有正确匹配的布尔掩码:

We can see without the boolean masking that it correctly matches:

In [69]:
df.apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False))

Out[69]:
  COLUMN_1 COLUMN_2
0     True     True
1     True    False
2    False     True
3    False    False
4    False    False

更新2

要再次回答您修改的问题:

To answer you modified question again:

In [75]:
df[df[list(column_filters.keys())].apply(lambda x: x.str.contains('|'.join(column_filters[x.name]), case=False)).any(axis=1)]

Out[75]:
  COLUMN_1 COLUMN_2 COLUMN_3
0  DrumSet    STAND    LOSER
1   GUITAR       DO     LOVE
2   String  KICKSET  LICKING

这篇关于使用字典将多个字符串包含过滤器应用于pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆