在Pandas中最快的计算方法? [英] Fastest way to calculate in Pandas?
问题描述
给出以下两个数据框:
df1 =
Name Start End
0 A 10 20
1 B 20 30
2 C 30 40
df2 =
0 1
0 5 10
1 15 20
2 25 30
df2
没有列名,但您可以假设第0列的偏移量是 df1.Start
,而第1列的偏移量是 df1.End
。我想将 df2
换位到 df1
上以获得开始和结束差异。最终的 df1
数据帧应如下所示:
df2
has no column names, but you can assume column 0 is an offset of df1.Start
and column 1 is an offset of df1.End
. I would like to transpose df2
onto df1
to get the Start and End differences. The final df1
dataframe should look like this:
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2
0 A 10 20 5 10 -5 0 -15 -10
1 B 20 30 15 20 5 10 -5 0
2 C 30 40 25 30 15 20 5 10
我有一个可行的解决方案,但我不满意,因为它花费的时间太长在处理具有数百万行的数据框时运行。下面是一个示例测试案例,用于模拟处理30,000行。可以想象,在1GB数据帧上运行原始解决方案(method_1)将是一个问题。是否可以使用Pandas,Numpy或其他软件包来更快地完成此操作?
I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?
更新:我已将提供的解决方案添加到
UPDATE: I've added the provided solutions to the benchmarks.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Original
def method_1():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Store data for new columns in a dictionary
new_columns = {}
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
key_start = 'Start_Diff_' + str(index2)
key_end = 'End_Diff_' + str(index2)
if (key_start in new_columns):
new_columns[key_start].append(row1[1]-row2[0])
else:
new_columns[key_start] = [row1[1]-row2[0]]
if (key_end in new_columns):
new_columns[key_end].append(row1[2]-row2[1])
else:
new_columns[key_end] = [row1[2]-row2[1]]
# Add dictionary data as new columns
for key, value in new_columns.items():
df1[key] = value
# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Convert selected columns to 2d numpy array
a = df1[['Start', 'End']].to_numpy()
b = df2[[0, 1]].to_numpy()
# Output is 3d array; convert it to 2d array
c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
# Generate columns names and with DataFrame.join; add to original
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Create numpy arrays of df1 and df2
df1_start = df1.loc[:, 'Start'].to_numpy()
df1_end = df1.loc[:, 'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
# Use np tile to create shapes that allow elementwise subtraction
tiled_start = np.tile(df1_start, (len(df2), 1)).T
tiled_end = np.tile(df1_end, (len(df2), 1)).T
# Subtract df2 from df1
start = np.subtract(tiled_start, df2_start)
end = np.subtract(tiled_end, df2_end)
# Create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
# Create dataframes of start and end
start_df = pd.DataFrame(start, columns=start_columns)
end_df = pd.DataFrame(end, columns=end_columns)
# Lump start and end into one dataframe
lump = pd.concat([start_df, end_df], axis=1)
# Sort the columns by the digits at the end
filtered = lump.columns[lump.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
lump = lump.reindex(cols, axis='columns')
# Hook lump back to df1
df1 = pd.concat([df1,lump],axis=1)
print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
输出:
Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845
推荐答案
我建议在此处使用numpy-将选定的列转换为 2d numpy $第一步中的c $ c>数组:
I suggest use here numpy - convert selected columns to 2d numpy
array in first step::
a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()
输出为3d数组,将其转换为 2d数组
:
Output is 3d array, convert it to 2d array
:
c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[ 5 10 -5 0 -15 -10]
[ 15 20 5 10 -5 0]
[ 25 30 15 20 5 10]]
最后生成列名称,并使用 DataFrame.join
添加到原始内容:
Last generate columns names and with DataFrame.join
add to original:
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 \
0 A 10 20 5 10 -5 0
1 B 20 30 15 20 5 10
2 C 30 40 25 30 15 20
Start_Diff_2 End_Diff_2
0 -15 -10
1 -5 0
2 5 10
这篇关于在Pandas中最快的计算方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!