向量化的方法来计算两列中任一列中的字符串的出现 [英] Vectorized way to count occurrences of string in either of two columns

查看:90
本文介绍了向量化的方法来计算两列中任一列中的字符串的出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到的问题与

I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution...

我有两个数据框,分别是df1df2:

I've got two dataframes, df1 and df2, like this:

import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})    
df2 = pd.DataFrame({'ID':names})

>>> df1
        ID_a      ID_b
0        joe       ben
1        ben      jack
2       jane       joe
3        ben      jill
4        ben  beatrice
5       jill       ben
6       jane       joe
7       jane      jack
8       jane      jack
9        ben      jane
10       joe      jane
11      jane      jill
12  beatrice       joe
13       ben       joe
14      jill  beatrice
15       joe  beatrice
16  beatrice  beatrice
17  beatrice      jane
18      jill       joe
19       joe       joe

>>> df2
         ID
0      jack
1      jill
2      jane
3       joe
4       ben
5  beatrice

我想做的是在df2中添加一列,并在df1中添加 count 行,在其中中都可以找到给定名称ID_aID_b,结果如下:

What I'd like to do is add in a column to df2, with the count of rows in df1 where the given name can be found in either column ID_a or ID_b, resulting in this:

>>> df2
         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

此循环满足了我的需要,但是对于大型数据框而言效率不高,如果有人可以提出替代的更好的解决方案,我将不胜感激:

This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful:

df2['count'] = 0

for idx,row in df2.iterrows():
    df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])

提前谢谢!

推荐答案

任一"部分使事情复杂,但仍应可行.

The "either" part complicates things, but should still be doable.

选项1
由于其他用户决定将其变成速度竞赛,因此这是我的:

Option 1
Since other users decided to turn this into a speed-race, here's mine:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


选项2
(原始答案)基于stack


Option 2
(Original answer) stack based

c = df1.stack().groupby(level=0).value_counts().count(level=1)

或者,

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

或者,

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

然后

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6


选项3
基于repeat的重塑和计数


Option 3
repeat-based Reshape and counting

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6


选项4
concat + mask


Option 4
concat + mask

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

这篇关于向量化的方法来计算两列中任一列中的字符串的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆