计算多个字符串列中一个字符串的出现 [英] Count occurrences of a string in multiple string columns
问题描述
我有一个名为df
的数据框,看起来与此类似(除了'mat_deliv'列的数量上升到mat_deliv_8之外,在Client_ID
和mat_deliv_1
之间有数百个客户端和许多其他列-我已经在这里简化了.)
I have a dataframe called df
that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8, there are several hundred clients and a number of other columns between Client_ID
and mat_deliv_1
- I have simplified it here).
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4
C1019876 xxx,yyy,zzz aaa,bbb,xxx xxx ddd
C1018765 yyy,zzz xxx xxx None
C1017654 yyy,xxx aaa,bbb ccc ddd
C1016543 aaa,bbb ccc None None
C1019876 yyy None None None
我想创建一个名为xxx_count
的新列,该列计算xxx
在mat_deliv_1
,mat_deliv_2
,mat_deliv_3
和mat_deliv_4
中出现的次数.值应如下所示:
I want to create a new column called xxx_count
which counts the number of times xxx
appears in mat_deliv_1
, mat_deliv_2
, mat_deliv_3
and mat_deliv_4
. The values should look like this:
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4 xxx_count
C1019876 xxx,yyy,zzz aaa,xxx,bbb xxx ddd 3
C1018765 yyy,zzz xxx xxx None 2
C1017654 yyy,xxx aaa,bbb ccc ddd 1
C1016543 aaa,bbb ccc None None 0
C1015432 yyy None None None 0
我尝试了以下代码:
df = df.assign(xxx_count=df.loc[:, "mat_deliv_1":"mat_deliv_4"].\
apply(lambda col: col.str.count('xxx')).fillna(0).astype(int))
但是它不产生计数,只是一个二进制变量,其中0
=在至少四个mat_deliv
列中至少一个列中,xxx
均不包含xxx
,并且1
=不存在xxx
.
But it does not produce a count, only a binary variable where 0
= no cases of xxx
and 1
= the presence of xxx
in at least one of the four mat_deliv
columns.
NB: this is a follow-up question to that asked here: Creating a column based on the presence of part of a string in multiple other columns
推荐答案
在计数之前尝试将它们水平结合吗?
Try joining them horizontally before counting?
df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
.fillna('')
.agg(','.join, 1)
.str.count('xxx'))
df
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4 counts
0 C1019876 xxx,yyy,zzz aaa,bbb,xxx xxx ddd 3
1 C1018765 yyy,zzz xxx xxx NaN 2
2 C1017654 yyy,xxx aaa,bbb ccc ddd 1
3 C1016543 aaa,bbb ccc NaN NaN 0
4 C1019876 yyy NaN NaN NaN 0
这将在每个列最多出现一次"xxx"的情况下起作用.如果发生多次,它将对每次发生进行计数.
This will work assuming "xxx" occurs upto only once per column. If it occurs more than once, it will count each occurrence.
另一个选项涉及stack
:
df['counts'] = (
df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.count('xxx').sum(level=0))
df
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4 counts
0 C1019876 xxx,yyy,zzz aaa,bbb,xxx xxx ddd 3
1 C1018765 yyy,zzz xxx xxx NaN 2
2 C1017654 yyy,xxx aaa,bbb ccc ddd 1
3 C1016543 aaa,bbb ccc NaN NaN 0
4 C1019876 yyy NaN NaN NaN 0
使用str.contains
,可以轻松地将其修改为仅对首次出现进行计数:
This can easily be modified to count just the first occurrence, using str.contains
:
df['counts'] = (
df.loc[:, "mat_deliv_1":"mat_deliv_4"].stack().str.contains('xxx').sum(level=0))
如果"xxx"有可能是子字符串,请先拆分然后计数:
If it is possible for "xxx" to be a substring, first split and then count:
df['counts'] = (df.loc[:, "mat_deliv_1":"mat_deliv_4"]
.stack()
.str.split(',', expand=True)
.eq('xxx')
.any(1) # change to `.sum(1)` to count all occurrences
.sum(level=0))
为提高性能,请使用列表理解:
For performance, use a list comprehension:
df['counts'] = [
','.join(x).count('xxx')
for x in df.loc[:, "mat_deliv_1":"mat_deliv_4"].fillna('').values
]
df
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4 counts
0 C1019876 xxx,yyy,zzz aaa,bbb,xxx xxx ddd 3
1 C1018765 yyy,zzz xxx xxx NaN 2
2 C1017654 yyy,xxx aaa,bbb ccc ddd 1
3 C1016543 aaa,bbb ccc NaN NaN 0
4 C1019876 yyy NaN NaN NaN 0
为什么循环比使用str
方法或apply
更快?请参阅有关熊猫的循环-我何时应该关心?.
Why is a loop faster than using str
methods or apply
? See For loops with pandas - When should I care?.
这篇关于计算多个字符串列中一个字符串的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!