如何计算两列中任一列的出现次数 [英] How to count the number of occurrences in either of two columns

查看:93
本文介绍了如何计算两列中任一列的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的问题.我有一个具有两列的数据框df.对于这两个列中任一列中出现的每个字符串,我都希望计算其中任一列中都有符号的行数.

I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column.

例如

g k
a h
c i
j e
d i
i h
b b
d d
i a
d h

以下代码有效,但效率很低.

The following code works but is very inefficient.

for elem in set(df.values.flat):
    print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])


a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1

但是,这效率很低,而且我的数据帧很大.效率低下是因为分别为df中的每个不同符号调用了df.loc[(df[0] == elem) | (df[1] == elem)].

This is however very inefficient and my dataframe is large. The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df.

有快速的方法吗?

推荐答案

您可以使用loc过滤掉'col2'中的行级别匹配项,将过滤后的'col2'值附加到'col1',然后调用value_counts:

You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:

counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()

结果输出:

i    4
d    3
h    3
a    2
j    1
k    1
c    1
g    1
b    1
e    1

注意:如果希望输出按字母顺序显示,可以在计数代码的末尾添加.sort_index().

Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.

时间

使用以下设置来生成更大的样本数据集:

Using the following setup to produce a larger sample dataset:

from string import ascii_lowercase

n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])

def edchum(df):
    vals = np.unique(df.values)
    count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
    return count

我得到以下计时:

%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop

%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop

这篇关于如何计算两列中任一列的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆