如何通过数据帧一次有效地计算行数 [英] How to count rows efficiently with one pass over the dataframe
问题描述
我有一个像这样的字符串组成的数据框:
I have a dataframe made of strings like this:
ID_0 ID_1
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
对于每对字符串,我可以按以下方式计算其中每个字符串有多少行.
For each pair of strings I can count how many rows have either string in them as follows.
import pandas as pd
import itertools
df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])
alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])
这给出了:
a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]
问题是我的数据帧非常大,字母的大小为200,并且此方法针对每对字母对整个数据帧进行独立遍历.
The problem is that my dataframe is very large and the alphabet is of size 200 and this method does an independent traversal over the whole dataframe for each pair of letters.
是否可以通过某种方式对数据帧进行一次传递来获得相同的输出?
Is it possible to get the same output by doing a single pass over the dataframe somehow?
时间
我用以下方法创建了一些数据:
I created some data with:
import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])
#Testing Parfait's answer
def f(row):
ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
(df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
return(ser)
%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop
我希望能够做到n = 10 ** 8.可以加快速度吗?
I would like to be able to do this for n = 10**8. Can this be sped up?
推荐答案
通过使用一些巧妙的组合/集合理论进行计数,您可以跳过行级子迭代:
You can get past the row level subiteration by using some clever combinatorics/set theory to do the counting:
# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
# Get the counts.
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
结果输出:
ID_0 ID_1 count
0 g k 1
1 a h 4
2 c i 4
3 j e 1
4 d i 6
5 i h 6
6 b b 1
7 d d 3
8 i a 5
9 d h 5
我已经将我的方法的输出与行级迭代方法在具有5000行且所有计数都匹配的数据集上进行了比较.
I've compared the output of my method to the row level iteration method on a dataset with 5000 rows and all of the counts match.
为什么这样做?实际上,它仅依赖于计算两个集合的并集的公式:
Why does this work? It essentially just relies on the formula for counting the union of two sets:
给定元素的基数就是char_count
.当元素不同时,交集的基数就是任意顺序的元素对数.请注意,当两个元素相同时,公式将简化为char_count
.
The cardinality of a given element is just the char_count
. When the elements are distinct, the cardinality of the intersection is just the count of pairs of the elements in any order. Note that when the two elements are identical, the formula reduces to just the char_count
.
时间
使用问题中的计时设置,并为我的答案使用以下功能:
Using the timing setup in the question, and the following function for my answer:
def root(df):
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
return df
我得到n=10**4
的以下时间安排:
I get the following timings for n=10**4
:
%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop
%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop
我得到n=10**6
的以下时间安排:
I get the following timing for n=10**6
:
%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop
我的解决方案似乎呈线性比例缩放.
It appears that my solution scales approximately linearly.
这篇关于如何通过数据帧一次有效地计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!