如何通过数据帧一次有效地计算行数 [英] How to count rows efficiently with one pass over the dataframe

查看:116
本文介绍了如何通过数据帧一次有效地计算行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的字符串组成的数据框:

I have a dataframe made of strings like this:

ID_0 ID_1
 g    k
 a    h
 c    i
 j    e
 d    i
 i    h
 b    b
 d    d
 i    a
 d    h

对于每对字符串,我可以按以下方式计算其中每个字符串有多少行.

For each pair of strings I can count how many rows have either string in them as follows.

import pandas as pd
import itertools

df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])

alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
    print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])

这给出了:

a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]

问题是我的数据帧非常大,字母的大小为200,并且此方法针对每对字母对整个数据帧进行独立遍历.

The problem is that my dataframe is very large and the alphabet is of size 200 and this method does an independent traversal over the whole dataframe for each pair of letters.

是否可以通过某种方式对数据帧进行一次传递来获得相同的输出?

Is it possible to get the same output by doing a single pass over the dataframe somehow?

时间

我用以下方法创建了一些数据:

I created some data with:

import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])

#Testing Parfait's answer
def f(row):
    ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
                 (df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
    return(ser)

%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop

我希望能够做到n = 10 ** 8.可以加快速度吗?

I would like to be able to do this for n = 10**8. Can this be sped up?

推荐答案

通过使用一些巧妙的组合/集合理论进行计数,您可以跳过行级子迭代:

You can get past the row level subiteration by using some clever combinatorics/set theory to do the counting:

# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()

# Get the counts.
df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]

结果输出:

  ID_0 ID_1  count
0    g    k      1
1    a    h      4
2    c    i      4
3    j    e      1
4    d    i      6
5    i    h      6
6    b    b      1
7    d    d      3
8    i    a      5
9    d    h      5

我已经将我的方法的输出与行级迭代方法在具有5000行且所有计数都匹配的数据集上进行了比较.

I've compared the output of my method to the row level iteration method on a dataset with 5000 rows and all of the counts match.

为什么这样做?实际上,它仅依赖于计算两个集合的并集的公式:

Why does this work? It essentially just relies on the formula for counting the union of two sets:

给定元素的基数就是char_count.当元素不同时,交集的基数就是任意顺序的元素对数.请注意,当两个元素相同时,公式将简化为char_count.

The cardinality of a given element is just the char_count. When the elements are distinct, the cardinality of the intersection is just the count of pairs of the elements in any order. Note that when the two elements are identical, the formula reduces to just the char_count.

时间

使用问题中的计时设置,并为我的答案使用以下功能:

Using the timing setup in the question, and the following function for my answer:

def root(df):
    char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
    pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
    df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
    return df

我得到n=10**4的以下时间安排:

I get the following timings for n=10**4:

%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop

%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop

我得到n=10**6的以下时间安排:

I get the following timing for n=10**6:

%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop

我的解决方案似乎呈线性比例缩放.

It appears that my solution scales approximately linearly.

这篇关于如何通过数据帧一次有效地计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆