pandas 数据框:特定ID的字符串值的计数在行中 [英] Pandas dataframe: count number of string value is in row for specific ID
问题描述
我有以下用例:
我想创建一个数据框,其中每一行都有一列,我可以在其中查看类别中该ID(用户)的交互次数.对我而言,最难的是,它们不能被重复计算,而仅其中一种类别的匹配就足以算作1.
例如,我有:
richtingen id
0 Marketing, Sales 1110
1 Marketing, Sales 1110
2 Finance 220
3 Marketing, Engineering 1110
4 IT 3300
现在,我想创建第三行,在这里我可以看到该ID与这些类别中的任何类别进行了总共多少次交互.每个逗号都是一个单独的类别,例如:市场营销,销售"是市场营销和销售两个类别.要获得+1,您只需要与ID相同且其中一个类别匹配的另一行进行匹配,例如对于索引0,它将为3(索引0、1和3匹配).该示例的输出数据应为:
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
对我来说,困难的部分似乎是我无法将所有类别都添加到新行中,因为那样的话,您也许会开始增加一倍.例如,索引0与索引1的市场营销"和销售"都匹配,而我只希望它加1,而不是2.
到目前为止,我的代码是:
df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')
这仅匹配相同的类别组合.
我尝试过的其他方法: -创建一个将所有空缺都拆分为一个数组的新列:
df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))
然后计划是将这段代码中的某些内容与id上的groupby结合使用,以计算每个项目为true的次数:
if any(t < 0 for t in x):
# do something
我也无法使它正常工作.
- 我尝试将类别拆分为新的行或列,但出现了重复计算的问题.
例如,使用建议的代码:
df['richtingen'].str.split(', ',expand=True)
请给我以下内容:
0 1 id
0 Marketing Sales 1110
1 Marketing Sales 1110
2 dDD None 220
3 Marketing Engineering 1110
4 ddsad None 3300
但是接下来,我将需要创建遍历每一行的代码,然后检查ID,在各列中列出值,并检查它们是否包含在其他任何列中(其中ID相同),以及是否其中匹配项将1加到频率.我怀疑这段代码可以与groupby一起使用,但是不确定,也无法弄清楚.
- (下面的Jezrael建议的解决方案): 如果需要首先按ID划分每个ID的唯一类别,请按堆栈创建MultiIndex Series,最后将SeriesGroupBy.nunique与map一起用于原始DataFrame的新列.
我认为该解决方案可能与此类似,但是目前它计算的是唯一类别的总数(而不是与类别进行交互的唯一次数).例如,此处索引2的输出为2,而应为1(因为用户只与类别进行了一次交互).
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance, Accounting 220 2
3 Marketing, Engineering 1110 3
4 IT 3300 1
希望我让自己很清楚,而且任何人都知道该如何解决!总共大约有13个类别,总在一个单元格中,但用逗号分隔.
对于msr_003:
id richtingen freq_x freq_y
0 220 Finance, IT 0 2
1 1110 Finance, IT 1 2
2 1110 Marketing, Sales 2 4
3 1110 Marketing, Sales 3 4
4 220 Marketing 4 1
5 220 Finance 5 2
6 1110 Marketing, Sales 6 4
7 3300 IT 7 1
8 1110 Marketing, IT 8 4
如果需要首先按id
计数唯一的类别,则 stack
并最后使用 map
用于原始DataFrame
的新列:
s = (df.set_index('id')['richtingen']
.str.split(', ',expand=True)
.stack()
.groupby(level=0)
.nunique())
print (s)
id
220 1
1110 3
3300 1
dtype: int64
df['freq'] = df['id'].map(s)
print (df)
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
详细信息:
print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id
1110 0 Marketing
1 Sales
0 Marketing
1 Sales
220 0 Finance
1110 0 Marketing
1 Engineering
3300 0 IT
dtype: object
I have the following use case:
I want to make a dataframe where for each row I have a column where I can see how many interactions there have been for this ID (user) in the categories. The hardest thing to me is that they can't be double counted, while a match in just one of the categories is enough to be counted as 1.
So for example I have:
richtingen id
0 Marketing, Sales 1110
1 Marketing, Sales 1110
2 Finance 220
3 Marketing, Engineering 1110
4 IT 3300
Now I want to create a third row where I can see how many times this ID has interacted with any of these categories in total. Each comma is a category on it's own so for example: "Marketing, Sales" are the two categories Marketing and Sales. To get a +1 you only need to have a match with another row where ID is the same and one of the categories matches, so for example for the index 0 it would be 3 (indexes 0, 1 and 3 match). The output data for the example should be:
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
The hard part for me seems to be that I can't all categories into new rows, as then you perhaps will start counting double. For example index 0 matches both Marketing and Sales of index 1 and I want it just to add 1, not 2.
The code I have so far is:
df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')
this only matches identical combination of categories though.
Other things I've tried: - creating a new column with all vacancies split into an array:
df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))
and then the plan was to use something along this code in combination with groupby on id to count number of times it is true per item:
if any(t < 0 for t in x):
# do something
I couldn't get this to work either.
- I tried splitting categories in new rows, or columns but then got an issue of double counting.
For example using code suggested:
df['richtingen'].str.split(', ',expand=True)
Gives me the following:
0 1 id
0 Marketing Sales 1110
1 Marketing Sales 1110
2 dDD None 220
3 Marketing Engineering 1110
4 ddsad None 3300
But then I will need to create code that goes over every row, then checks the ID, lists the values in the columns and checks if they are contained in any of the other columns (where ID is the same) and if one of them matches add 1 to freq. This code I suspect might be able with groupby, but am not sure, and can't figure it out.
- (Solution suggested by Jezrael below): If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame.
I think this solution perhaps is something similar to this, but at the moment it counts the total number of unique categories (not the unique number of interaction with categories). For example output at index 2 here is 2, while it should be 1 (as the user only interacted with the categories once).
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance, Accounting 220 2
3 Marketing, Engineering 1110 3
4 IT 3300 1
Hope I made myself clear and anyone knows how to fix this! In total there will be around 13 categories, always in one cell, but divided by a comma.
For msr_003:
id richtingen freq_x freq_y
0 220 Finance, IT 0 2
1 1110 Finance, IT 1 2
2 1110 Marketing, Sales 2 4
3 1110 Marketing, Sales 3 4
4 220 Marketing 4 1
5 220 Finance 5 2
6 1110 Marketing, Sales 6 4
7 3300 IT 7 1
8 1110 Marketing, IT 8 4
If need count unique catagories per id
first split
, create MultiIndex Series
by stack
and last use SeriesGroupBy.nunique
with map
for new column of original DataFrame
:
s = (df.set_index('id')['richtingen']
.str.split(', ',expand=True)
.stack()
.groupby(level=0)
.nunique())
print (s)
id
220 1
1110 3
3300 1
dtype: int64
df['freq'] = df['id'].map(s)
print (df)
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
Detail:
print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id
1110 0 Marketing
1 Sales
0 Marketing
1 Sales
220 0 Finance
1110 0 Marketing
1 Engineering
3300 0 IT
dtype: object
这篇关于 pandas 数据框:特定ID的字符串值的计数在行中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!