为什么2012年Pandas在python中的合并速度比data.table在R中的合并速度快? [英] Why were pandas merges in python faster than data.table merges in R in 2012?

查看:166
本文介绍了为什么2012年Pandas在python中的合并速度比data.table在R中的合并速度快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近遇到了python的 pandas 库,根据 data.table 包还要快. (我选择分析的语言).

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

为什么pandasdata.table快得多?是因为python相对于R具有固有的速度优势,还是我没有意识到一些折衷方案?有没有一种方法可以在data.table中执行内部和外部联接,而无需求助于merge(X, Y, all=FALSE)merge(X, Y, all=TRUE)?

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

这是 R代码用于对各种程序包进行基准测试的Python代码.

推荐答案

当唯一字符串( levels )的数量很大时,Wes可能在data.table中发现了一个已知问题: 10,000.

It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

Rprof()会显示呼叫sortedmatch(levels(i[[lc]]), levels(x[[rc]])中花费的大部分时间吗?这实际上并不是联接本身(算法),而是第一步.

Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.

最近的努力是允许键中的字符列,这应该通过与R自己的全局字符串哈希表更紧密地集成来解决该问题. test.data.table()已经报告了一些基准测试结果,但是尚未连接该代码以替换级别以匹配级别.

Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn't hooked up yet to replace the levels to levels match.

对于常规整数列,pandas的合并速度是否比data.table快?那应该是一种隔离算法本身与因素问题的方法.

Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

此外,data.table考虑了时间序列合并.这有两个方面:i)多列有序的键,例如(id,datetime); ii)快速占优的连接(roll=TRUE),也称为最后结转的结局.

Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

我需要一些时间来确认,因为这是我第一次看到与呈现的data.table的比较.

I'll need some time to confirm as it's the first I've seen of the comparison to data.table as presented.

2012年7月发布的data.table v1.8.0中的更新

  • 内部函数sortedmatch()已删除,并替换为chmatch() 当将因子"类型的列的i级与x级进行匹配时.这 初步步骤导致(已知)显着的速度下降 因子列的水平很大(例如> 10,000).加剧 Wes McKinney演示的连接四个此类列的测试 (Python软件包Pandas的作者).匹配一百万个字符串 例如,其中的600,000个是唯一的,现在从16秒减少到0.5秒.
  • Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

也是:

    现在,键中允许使用
  • 个字符列,并且首选 因素. data.table()和setkey()不再将字符强制转换为 因素.仍然支持因素.实现FR#1493,FR#1224 和(部分)FR#951.

  • character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.

新功能chmatch()和%chin%,match()的更快版本 和%in%用于字符向量. R的内部字符串缓存是 使用(未构建哈希表).它们快约4倍 比?chmatch中的示例中的match()还要好.

New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

截至2013年9月,data.table在CRAN上的版本为v1.8.10,我们正在开发v1.9.0. 新闻 已实时更新.

As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.

但是正如我最初写的那样,上面:

But as I wrote originally, above :

data.table考虑了时间序列合并.这有两个方面:i) 多列有序键,例如(id,datetime)ii)快速流行 加入(roll=TRUE)a.k.a.结转的最后一个观察结果.

data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

因此,两个字符列的熊猫等值连接可能仍比data.table快.由于听起来像是对合并的两列进行哈希处理. data.table不会散列键,因为它考虑了普遍的有序连接. data.table中的键"实际上只是排序顺序(类似于SQL中的聚集索引;也就是说,这就是在RAM中对数据进行排序的方式).例如,在列表上将添加辅助键.

So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.

总而言之,由于已知问题已得到解决,因此该特殊的两字符列测试突出显示了明显的速度差异,该测试包含超过10,000个唯一字符串.

In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.

这篇关于为什么2012年Pandas在python中的合并速度比data.table在R中的合并速度快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆