为什么R的data.table比pandas快得多? [英] Why is R's data.table so much faster than pandas?
问题描述
我有一个1200万行的数据集,其中3列作为唯一标识符,另外2列具有值。我正在尝试做一个比较简单的任务:
-按三个标识符分组。这样就产生了大约260万种唯一组合。
-任务1:计算列 Val1
的中位数-任务2:计算在 Val2
I have a 12 million rows dataset, with 3 columns as unique identifiers and another 2 columns with values. I'm trying to do a rather simple task:
- group by the three identifiers. This yields about 2.6 million unique combinations
- Task 1: calculate the median for column Val1
- Task 2: calculate the mean for column Val1
given some condition on Val2
这是我的结果,使用 pandas
和 data.table
(目前都是同一台计算机上的最新版本) :
Here are my results, using pandas
and data.table
(both latest versions at the moment, on the same machine):
+-----------------+-----------------+------------+
| | pandas | data.table |
+-----------------+-----------------+------------+
| TASK 1 | 150 seconds | 4 seconds |
| TASK 1 + TASK 2 | doesn't finish | 5 seconds |
+-----------------+-----------------+------------+
我想我对熊猫可能做错了-转换 Grp1
和 Grp2
归类没有太大帮助,在 .agg $ c $之间切换也没有太大帮助。 c>和
.apply
。有什么想法吗?
I think I may be doing something wrong with pandas - transforming Grp1
and Grp2
into categories didn't help a lot, nor did switching between .agg
and .apply
. Any ideas?
下面是可复制的代码。
数据帧生成:
Below is the reproducible code.
Dataframe generation:
import numpy as np
import pandas as pd
from collections import OrderedDict
import time
np.random.seed(123)
list1 = list(pd.util.testing.rands_array(10, 750))
list2 = list(pd.util.testing.rands_array(10, 700))
list3 = list(np.random.randint(100000,200000,5))
N = 12 * 10**6 # please make sure you have enough RAM
df = pd.DataFrame({'Grp1': np.random.choice(list1, N, replace = True),
'Grp2': np.random.choice(list2, N, replace = True),
'Grp3': np.random.choice(list3, N, replace = True),
'Val1': np.random.randint(0,100,N),
'Val2': np.random.randint(0,10,N)})
# this works and shows there are 2,625,000 unique combinations
df_test = df.groupby(['Grp1','Grp2','Grp3']).size()
print(df_test.shape[0]) # 2,625,000 rows
# export to feather so that same df goes into R
df.to_feather('file.feather')
Python中的任务1:
Task 1 in Python:
# TASK 1: 150 seconds (sorted / not sorted doesn't seem to matter)
df.sort_values(['Grp1','Grp2','Grp3'], inplace = True)
t0 = time.time()
df_agg1 = df.groupby(['Grp1','Grp2','Grp3']).agg({'Val1':[np.median]})
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0))
Python中的任务1 +任务2:
Task 1 + Task 2 in Python:
# TASK 1 + TASK 2: this kept running for 10 minutes to no avail
# (sorted / not sorted doesn't seem to matter)
def f(x):
d = OrderedDict()
d['Median_all'] = np.median(x['Val1'])
d['Median_lt_5'] = np.median(x['Val1'][x['Val2'] < 5])
return pd.Series(d)
t0 = time.time()
df_agg2 = df.groupby(['Grp1','Grp2','Grp3']).apply(f)
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0)) # didn't complete
等效的R代码:
library(data.table)
library(feather)
DT = setDT(feater("file.feather"))
system.time({
DT_agg <- DT[,.(Median_all = median(Val1),
Median_lt_5 = median(Val1[Val2 < 5]) ), by = c('Grp1','Grp2','Grp3')]
}) # 5 seconds
推荐答案
我无法重现您的R结果,我修复了您拼错羽毛的错别字,但是得到了以下内容:
I can't reproduce your R results, I fixed the typo where you misspelled feather, but I get the following:
Error in `[.data.table`(DT, , .(Median_all = median(Val1), Median_lt_5 = median(Val1[Val2 < :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
对于python示例,如果要获取Val2的每个组的中位数小于5,则应首先进行过滤,例如:
As to the python example, If you want to get the median for each group where Val2 is less than 5 then you should filter first, as in:
df[df.Val2 < 5].groupby(['Grp1','Grp2','Grp3'])['Val2'].median()
这在我的Macbook pro上不到8秒即可完成。
This completes in under 8 seconds on my macbook pro.
这篇关于为什么R的data.table比pandas快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!