计算多个字符串列中一个字符串的出现 [英] Count occurrences of a string in multiple string columns

查看:93
本文介绍了计算多个字符串列中一个字符串的出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为df的数据框,看起来与此类似(除了'mat_deliv'列的数量上升到mat_deliv_8之外,在Client_IDmat_deliv_1之间有数百个客户端和许多其他列-我已经在这里简化了.)

I have a dataframe called df that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8, there are several hundred clients and a number of other columns between Client_ID and mat_deliv_1 - I have simplified it here).

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4
C1019876   xxx,yyy,zzz  aaa,bbb,xxx  xxx          ddd
C1018765   yyy,zzz      xxx          xxx          None
C1017654   yyy,xxx      aaa,bbb      ccc          ddd
C1016543   aaa,bbb      ccc          None         None
C1019876   yyy          None         None         None

我想创建一个名为xxx_count的新列,该列计算xxxmat_deliv_1mat_deliv_2mat_deliv_3mat_deliv_4中出现的次数.值应如下所示:

I want to create a new column called xxx_count which counts the number of times xxx appears in mat_deliv_1, mat_deliv_2, mat_deliv_3 and mat_deliv_4. The values should look like this:

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4  xxx_count
C1019876   xxx,yyy,zzz  aaa,xxx,bbb  xxx          ddd          3
C1018765   yyy,zzz      xxx          xxx          None         2
C1017654   yyy,xxx      aaa,bbb      ccc          ddd          1
C1016543   aaa,bbb      ccc          None         None         0
C1015432   yyy          None         None         None         0

我尝试了以下代码:

df = df.assign(xxx_count=df.loc[:, "mat_deliv_1":"mat_deliv_4"].\
               apply(lambda col: col.str.count('xxx')).fillna(0).astype(int))

但是它不产生计数,只是一个二进制变量,其中0 =在至少四个mat_deliv列中至少一个列中,xxx均不包含xxx,并且1 =不存在xxx.

But it does not produce a count, only a binary variable where 0 = no cases of xxx and 1 = the presence of xxx in at least one of the four mat_deliv columns.

注意:这是在这里提出的后续问题:

NB: this is a follow-up question to that asked here: Creating a column based on the presence of part of a string in multiple other columns

推荐答案

在计数之前尝试将它们水平结合吗?

Try joining them horizontally before counting?

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .fillna('')
                  .agg(','.join, 1)
                  .str.count('xxx'))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

这将在每个列最多出现一次"xxx"的情况下起作用.如果发生多次,它将对每次发生进行计数.

This will work assuming "xxx" occurs upto only once per column. If it occurs more than once, it will count each occurrence.

另一个选项涉及stack:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.count('xxx').sum(level=0))
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

使用str.contains,可以轻松地将其修改为仅对首次出现进行计数:

This can easily be modified to count just the first occurrence, using str.contains:

df['counts'] = (
    df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.contains('xxx').sum(level=0))

如果"xxx"有可能是子字符串,请先拆分然后计数:

If it is possible for "xxx" to be a substring, first split and then count:

df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
                  .stack()
                  .str.split(',', expand=True)
                  .eq('xxx')
                  .any(1)  # change to `.sum(1)` to count all occurrences
                  .sum(level=0))


为提高性能,请使用列表理解:


For performance, use a list comprehension:

df['counts'] = [
    ','.join(x).count('xxx') 
    for x in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values
]
df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  counts
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         xxx         ddd       3
1  C1018765      yyy,zzz          xxx         xxx         NaN       2
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd       1
3  C1016543      aaa,bbb          ccc         NaN         NaN       0
4  C1019876          yyy          NaN         NaN         NaN       0

为什么循环比使用str方法或apply更快?请参阅有关熊猫的循环-我何时应该关心?.

Why is a loop faster than using str methods or apply? See For loops with pandas - When should I care?.

这篇关于计算多个字符串列中一个字符串的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆