使用过滤器计算第二个数据帧中某个值的出现次数 [英] Counting occurrences of a value in second dataframe with filter
问题描述
我正在尝试计算第二个数据帧中两个时间段之间 id 值的出现次数,并将该值作为列附加到第一个数据帧中.
I am trying to count the occurrence of an id value between two time periods in a second dataframe, and append the value as a column in the first dataframe.
我目前有以下代码,它很好地解决了问题,但在我的应用程序中造成了很大的瓶颈.
I currently have the following code, which solves the problem fine but causes a large bottleneck in my application.
import pandas as pd
dfA = pd.DataFrame({
'id' : ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
'start': [1, 4, 2, 1, 4, 5, 3, 6],
'end': [3, 6, 7, 5, 7, 7, 8, 10]})
dfB = pd.DataFrame({
'id': ['A', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 'C', 'C', 'D', 'A', 'D'],
'time': [1, 5, 2, 6, 8, 3, 5, 7, 8, 9, 5, 3, 6]})
def count(start, end, df_id, dfB):
dfA_valid = dfB[(dfB.time >= start) &
(dfB.time <= end) &
(dfB.id == df_id)]
dfA_count = len(dfA_valid.index)
return dfA_count
def get_counts(dfA, dfB):
dfA['count'] = dfA.apply(lambda x: count(
x['start'], x['end'],
x['id'], dfB), axis = 1)
return dfA
dfA_solved = get_counts(dfA.copy(), dfB)
dfA 是大约 100 万行,dfB 是大约 200 万行,这部分需要很长时间才能运行.有没有办法加快速度?
dfA is c.1 million rows and dfB is c.2 million rows and this section is taking a long time to run. Is there a way to speed this up?
推荐答案
酷.我会尝试两件事,重写一点代码和并行化.
Cool. I will try two things, to rewrite a bit the code and parallelization.
重现初始数据
import pandas as pd
import time
dfA = pd.DataFrame({
'id': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
'start': [1, 4, 2, 1, 4, 5, 3, 6],
'end': [3, 6, 7, 5, 7, 7, 8, 10]})
dfB = pd.DataFrame({
'id': ['A', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 'C', 'C', 'D', 'A', 'D'],
'time': [1, 5, 2, 6, 8, 3, 5, 7, 8, 9, 5, 3, 6]})
但我们希望有更多的东西来检查可扩展性.随意更改 N 的值:
But we want to have more to check scalability. Feel free to change the values of N:
# Create bigger samples
N =15
for i in range(0,N):
#print(i)
dfB = pd.concat([dfB,dfB])
N =6
for i in range(0,N):
#print(i)
dfA = pd.concat([dfA,dfA])
print('shape dfB ',dfB.shape)
print('shape dfA ',dfA.shape)
就我而言,输出是:
shape dfB (425984, 2)
shape dfA (512, 3)
现在我有足够大的数据可以玩了.
Now I have a good data size to play.
def count(start, end, df_id, dfB):
dfA_valid = dfB[(dfB.time >= start) &
(dfB.time <= end) &
(dfB.id == df_id)]
dfA_count = len(dfA_valid.index)
return dfA_count
def get_counts(dfA, dfB):
dfA['count'] = dfA.apply(lambda x: count(
x['start'], x['end'],
x['id'], dfB), axis=1)
return dfA
start = time.time()
dfA_solved = get_counts(dfA.copy(), dfB)
end = time.time()
print(end - start)
重现之前的代码:
在我的机器中 14.757256031036377 秒
14.757256031036377 seconds in my machine
我认为计数的方法可以更简单:
I think the way to count can be simpler:
# This is to use index for search
dfB.set_index('id',inplace=True)
#dfB.reset_index(inplace=True)
def count(start, end, df_id, dfB):
# This is to reduce the search
dfA_valid = dfB.loc[df_id,:].copy()
return ((dfA_valid.time >= start) &
(dfA_valid.time <= end)).sum()
def get_counts(dfA):
dfA['count'] = dfA.apply(lambda x: count(
x['start'], x['end'],
x['id'], dfB), axis=1)
return dfA
start = time.time()
dfA_solved = get_counts(dfA.copy())
end = time.time()
print("Changes on count function ",end - start)
向下做 7 秒.
但如果我们真的想走得更远,我们需要并行化.
But if we really want to way further we need to parallelize.
"""Now with parallelization"""
import numpy as np
from multiprocessing import Pool,cpu_count
num_cores = cpu_count()
num_partitions = num_cores-1
def parallelize(data, func):
data_split = np.array_split(data, num_partitions)
pool = Pool(num_cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
start = time.time()
dfA_solved = parallelize(dfA, get_counts)
end = time.time()
print("Changes on count function and parallelization ",end - start)
缩短到 4 秒.
shape dfB (425984, 2)
shape dfA (512, 3)
14.757256031036377
Changes on count function 7.768801212310791
Changes on count function and parallelization 4.349687814712524
这篇关于使用过滤器计算第二个数据帧中某个值的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!