为什么group_by->过滤器->在R中比在 pandas 中总结更快? [英] Why is group_by -> filter -> summarise faster in R than pandas?

查看:89
本文介绍了为什么group_by->过滤器->在R中比在 pandas 中总结更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将一些较旧的代码从R转换为python.在此过程中,发现熊猫要比R慢一点.有兴趣知道我是否做错了什么.

I am converting some of our older codes from R to python. In the process, have found pandas to be a bit slower than R. Interested in knowing if there is anything wrong I am doing.

R代码(在我的系统上大约需要2毫秒):

R Code (Taking around 2ms on my system):

df = data.frame(col_a = sample(letters[1:3],20,T),
           col_b = sample(1:2,20,T),
             col_c = sample(letters[1:2],20,T),
             col_d = sample(c(4,2),20,T)
             )

microbenchmark::microbenchmark(
a = df %>% 
  group_by(col_a, col_b) %>% 
  summarise(
    a = sum(col_c == 'a'),
    b = sum(col_c == 'b'),
    c = a/b
  ) %>% 
  ungroup()
)

pandas(在我的系统上需要10毫秒):

pandas (taking 10ms on my system):

df = pd.DataFrame({
    'col_a': np.random.choice(['a','b','c'],N),
    'col_b': np.random.choice([1,2],N),
    'col_c': np.random.choice(['a', 'b'],N),
    'col_d': np.random.choice(['4', '2'],N),
})
%%timeit 
df1 = df.groupby(['col_a', 'col_b']).agg({
    'col_c':[
        ('a',lambda x: (x=='a').sum()),
        ('b',lambda x: (x=='b').sum())
    ]}).reset_index()
df1['rat'] = df1.col_c.a/df1.col_c.b

推荐答案

这不是技术性的答案,但值得注意的是,在Pandas中有很多不同的方法可以完成此操作,有些方法比其他方法更快.例如,下面的Pandas代码在大约5毫秒内即可获取您要查找的值(尽管有一些丑陋的MultiIndex列):

This isn't a technical answer, but it's worth noting that there are a lot of different ways to accomplish this operation in Pandas, and some are faster than others. For example, the Pandas code below gets the values you're looking for (albeit with some ugly MultiIndex columns) in about 5ms:

df.groupby(['col_a', 'col_b', 'col_c'])\
  .count()\
  .unstack()\
  .assign(rat = lambda x: x.col_d.a/x.col_d.b)

4.96 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

除了幕后提速之外,我认为tidyverse语法相对于Pandas的主要提速优势是summarise()将使每个新变量在同一调用内立即可用,从而避免了必须使用计数,然后计算rat.

Aside from any under the hood speed ups, I think the main speed advantage of tidyverse syntax vs Pandas here is that summarise() will make each new variable immediately available, within the same call, which avoids having to spread the counts and then compute rat.

如果在Pandas中有类似的东西,我不知道.最接近的是pipe()或在assign()中使用lambda.链中的每个新函数调用都需要花费时间才能执行,因此Pandas最终会变慢.

If there's an analog to that in Pandas, I don't know it. The closest thing is either pipe() or the use of lambda within assign(). Each new function call in the chain takes time to execute, so Pandas ends up being slower.

这篇关于为什么group_by->过滤器->在R中比在 pandas 中总结更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆