pandas 数据框:特定ID的字符串值的计数在行中 [英] Pandas dataframe: count number of string value is in row for specific ID

查看:84
本文介绍了 pandas 数据框:特定ID的字符串值的计数在行中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下用例:

我想创建一个数据框,其中每一行都有一列,我可以在其中查看类别中该ID(用户)的交互次数.对我而言,最难的是,它们不能被重复计算,而仅其中一种类别的匹配就足以算作1.

例如,我有:

   richtingen             id   
0  Marketing, Sales       1110 
1  Marketing, Sales       1110 
2  Finance                220  
3  Marketing, Engineering 1110 
4  IT                     3300 

现在,我想创建第三行,在这里我可以看到该ID与这些类别中的任何类别进行了总共多少次交互.每个逗号都是一个单独的类别,例如:市场营销,销售"是市场营销和销售两个类别.要获得+1,您只需要与ID相同且其中一个类别匹配的另一行进行匹配,例如对于索引0,它将为3(索引0、1和3匹配).该示例的输出数据应为:

   richtingen             id   freq
0  Marketing, Sales       1110 3
1  Marketing, Sales       1110 3
2  Finance                220  1
3  Marketing, Engineering 1110 3
4  IT                     3300 1

对我来说,困难的部分似乎是我无法将所有类别都添加到新行中,因为那样的话,您也许会开始增加一倍.例如,索引0与索引1的市场营销"和销售"都匹配,而我只希望它加1,而不是2.

到目前为止,我的代码是:

df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')

这仅匹配相同的类别组合.

我尝试过的其他方法: -创建一个将所有空缺都拆分为一个数组的新列:

df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))

然后计划是将这段代码中的某些内容与id上的groupby结合使用,以计算每个项目为true的次数:

   if any(t < 0 for t in x):
   # do something

我也无法使它正常工作.

  • 我尝试将类别拆分为新的行或列,但出现了重复计算的问题.

例如,使用建议的代码:

 df['richtingen'].str.split(', ',expand=True)

请给我以下内容:

           0             1       id
    0  Marketing         Sales  1110
    1  Marketing         Sales  1110
    2        dDD          None   220
    3  Marketing   Engineering  1110
    4      ddsad          None  3300

但是接下来,我将需要创建遍历每一行的代码,然后检查ID,在各列中列出值,并检查它们是否包含在其他任何列中(其中ID相同),以及是否其中匹配项将1加到频率.我怀疑这段代码可以与groupby一起使用,但是不确定,也无法弄清楚.

  • (下面的Jezrael建议的解决方案): 如果需要首先按ID划分每个ID的唯一类别,请按堆栈创建MultiIndex Series,最后将SeriesGroupBy.nunique与map一起用于原始DataFrame的新列.

我认为该解决方案可能与此类似,但是目前它计算的是唯一类别的总数(而不是与类别进行交互的唯一次数).例如,此处索引2的输出为2,而应为1(因为用户只与类别进行了一次交互).

    richtingen              id     freq
 0  Marketing, Sales        1110     3
 1  Marketing, Sales        1110     3
 2  Finance, Accounting     220      2
 3  Marketing, Engineering  1110     3
 4  IT                      3300     1

希望我让自己很清楚,而且任何人都知道该如何解决!总共大约有13个类别,总在一个单元格中,但用逗号分隔.

对于msr_003:

         id          richtingen         freq_x  freq_y 
         0   220    Finance, IT           0       2
         1  1110    Finance, IT           1       2
         2  1110    Marketing, Sales      2       4
         3  1110    Marketing, Sales      3       4
         4   220    Marketing             4       1
         5   220    Finance               5       2
         6  1110    Marketing, Sales      6       4
         7  3300    IT                    7       1
         8  1110    Marketing, IT         8       4

解决方案

如果需要首先按id计数唯一的类别,则

   richtingen             id   freq
0  Marketing, Sales       1110 3
1  Marketing, Sales       1110 3
2  Finance                220  1
3  Marketing, Engineering 1110 3
4  IT                     3300 1

The hard part for me seems to be that I can't all categories into new rows, as then you perhaps will start counting double. For example index 0 matches both Marketing and Sales of index 1 and I want it just to add 1, not 2.

The code I have so far is:

df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')

this only matches identical combination of categories though.

Other things I've tried: - creating a new column with all vacancies split into an array:

df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))

and then the plan was to use something along this code in combination with groupby on id to count number of times it is true per item:

   if any(t < 0 for t in x):
   # do something

I couldn't get this to work either.

  • I tried splitting categories in new rows, or columns but then got an issue of double counting.

For example using code suggested:

 df['richtingen'].str.split(', ',expand=True)

Gives me the following:

           0             1       id
    0  Marketing         Sales  1110
    1  Marketing         Sales  1110
    2        dDD          None   220
    3  Marketing   Engineering  1110
    4      ddsad          None  3300

But then I will need to create code that goes over every row, then checks the ID, lists the values in the columns and checks if they are contained in any of the other columns (where ID is the same) and if one of them matches add 1 to freq. This code I suspect might be able with groupby, but am not sure, and can't figure it out.

  • (Solution suggested by Jezrael below): If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame.

I think this solution perhaps is something similar to this, but at the moment it counts the total number of unique categories (not the unique number of interaction with categories). For example output at index 2 here is 2, while it should be 1 (as the user only interacted with the categories once).

    richtingen              id     freq
 0  Marketing, Sales        1110     3
 1  Marketing, Sales        1110     3
 2  Finance, Accounting     220      2
 3  Marketing, Engineering  1110     3
 4  IT                      3300     1

Hope I made myself clear and anyone knows how to fix this! In total there will be around 13 categories, always in one cell, but divided by a comma.

For msr_003:

         id          richtingen         freq_x  freq_y 
         0   220    Finance, IT           0       2
         1  1110    Finance, IT           1       2
         2  1110    Marketing, Sales      2       4
         3  1110    Marketing, Sales      3       4
         4   220    Marketing             4       1
         5   220    Finance               5       2
         6  1110    Marketing, Sales      6       4
         7  3300    IT                    7       1
         8  1110    Marketing, IT         8       4

解决方案

If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame:

s = (df.set_index('id')['richtingen']
       .str.split(', ',expand=True)
       .stack()
       .groupby(level=0)
        .nunique())
print (s)
id
220     1
1110    3
3300    1
dtype: int64

df['freq'] = df['id'].map(s)
print (df)
               richtingen    id  freq
0        Marketing, Sales  1110     3
1        Marketing, Sales  1110     3
2                 Finance   220     1
3  Marketing, Engineering  1110     3
4                      IT  3300     1

Detail:

print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id     
1110  0      Marketing
      1          Sales
      0      Marketing
      1          Sales
220   0        Finance
1110  0      Marketing
      1    Engineering
3300  0             IT
dtype: object

这篇关于 pandas 数据框:特定ID的字符串值的计数在行中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆