使用过滤器计算第二个数据帧中某个值的出现次数 [英] Counting occurrences of a value in second dataframe with filter

查看:57
本文介绍了使用过滤器计算第二个数据帧中某个值的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算第二个数据帧中两个时间段之间 id 值的出现次数,并将该值作为列附加到第一个数据帧中.

I am trying to count the occurrence of an id value between two time periods in a second dataframe, and append the value as a column in the first dataframe.

我目前有以下代码,它很好地解决了问题,但在我的应用程序中造成了很大的瓶颈.

I currently have the following code, which solves the problem fine but causes a large bottleneck in my application.

import pandas as pd

dfA = pd.DataFrame({
    'id' : ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
    'start': [1, 4, 2, 1, 4, 5, 3, 6],
    'end': [3, 6, 7, 5, 7, 7, 8, 10]}) 

dfB = pd.DataFrame({
    'id': ['A', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 'C', 'C', 'D', 'A', 'D'],
    'time': [1, 5, 2, 6, 8, 3, 5, 7, 8, 9, 5, 3, 6]})


def count(start, end, df_id, dfB):
    dfA_valid = dfB[(dfB.time >= start) & 
                      (dfB.time <= end) &
                      (dfB.id == df_id)]
    dfA_count = len(dfA_valid.index)
    return dfA_count
    
    
def get_counts(dfA, dfB):
    dfA['count'] = dfA.apply(lambda x: count(
                            x['start'], x['end'], 
                            x['id'], dfB), axis = 1)
    return dfA

dfA_solved = get_counts(dfA.copy(), dfB)

dfA 是大约 100 万行,dfB 是大约 200 万行,这部分需要很长时间才能运行.有没有办法加快速度?

dfA is c.1 million rows and dfB is c.2 million rows and this section is taking a long time to run. Is there a way to speed this up?

推荐答案

酷.我会尝试两件事,重写一点代码和并行化.

Cool. I will try two things, to rewrite a bit the code and parallelization.

重现初始数据

import pandas as pd
import time

dfA = pd.DataFrame({
    'id': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
    'start': [1, 4, 2, 1, 4, 5, 3, 6],
    'end': [3, 6, 7, 5, 7, 7, 8, 10]})

dfB = pd.DataFrame({
    'id': ['A', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 'C', 'C', 'D', 'A', 'D'],
    'time': [1, 5, 2, 6, 8, 3, 5, 7, 8, 9, 5, 3, 6]})

但我们希望有更多的东西来检查可扩展性.随意更改 N 的值:

But we want to have more to check scalability. Feel free to change the values of N:

# Create bigger samples
N =15
for i in range(0,N):
    #print(i)
    dfB = pd.concat([dfB,dfB])

N =6
for i in range(0,N):
    #print(i)
    dfA = pd.concat([dfA,dfA])

print('shape dfB ',dfB.shape)
print('shape dfA ',dfA.shape)

就我而言,输出是:

shape dfB  (425984, 2)
shape dfA  (512, 3)

现在我有足够大的数据可以玩了.

Now I have a good data size to play.

def count(start, end, df_id, dfB):
    dfA_valid = dfB[(dfB.time >= start) &
                    (dfB.time <= end) &
                    (dfB.id == df_id)]
    dfA_count = len(dfA_valid.index)
    return dfA_count




def get_counts(dfA, dfB):
    dfA['count'] = dfA.apply(lambda x: count(
        x['start'], x['end'],
        x['id'], dfB), axis=1)
    return dfA


start = time.time()
dfA_solved = get_counts(dfA.copy(), dfB)
end = time.time()
print(end - start)

重现之前的代码:

在我的机器中 14.757256031036377 秒

14.757256031036377 seconds in my machine

我认为计数的方法可以更简单:

I think the way to count can be simpler:

# This is to use index for search
dfB.set_index('id',inplace=True)
#dfB.reset_index(inplace=True)

def count(start, end, df_id, dfB):
    # This is to reduce the search
    dfA_valid = dfB.loc[df_id,:].copy()
    return  ((dfA_valid.time >= start) &
    (dfA_valid.time <= end)).sum()



def get_counts(dfA):
    dfA['count'] = dfA.apply(lambda x: count(
        x['start'], x['end'],
        x['id'], dfB), axis=1)
    return dfA



start = time.time()
dfA_solved = get_counts(dfA.copy())
end = time.time()
print("Changes on count function ",end - start)

向下做 7 秒.

但如果我们真的想走得更远,我们需要并行化.

But if we really want to way further we need to parallelize.

"""Now with parallelization"""
import numpy as np
from multiprocessing import  Pool,cpu_count

num_cores = cpu_count()
num_partitions = num_cores-1


def parallelize(data, func):
    data_split = np.array_split(data, num_partitions)
    pool = Pool(num_cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

start = time.time()
dfA_solved = parallelize(dfA, get_counts)
end = time.time()
print("Changes on count function and parallelization ",end - start)

缩短到 4 秒.

shape dfB  (425984, 2)
shape dfA  (512, 3)
14.757256031036377
Changes on count function  7.768801212310791
Changes on count function and parallelization  4.349687814712524

这篇关于使用过滤器计算第二个数据帧中某个值的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆