为什么pandas合并在python比data.table合并在R? [英] Why are pandas merges in python faster than data.table merges in R?
问题描述
我最近发现了和 Python代码用于对各种程序包进行基准测试。
看起来像Wes可能在 data.table
中发现一个已知问题:唯一字符串(级别)的数量很大:10,000。
Rprof()
显示调用中花费的大部分时间 sortedmatch [lc]],levels(x [[rc]])
?这不是真正的连接本身(算法),而是一个初步步骤。
最近的努力已经允许在键中的字符列,这应该通过更紧密地集成R的自己的全局字符串哈希表来解决这个问题。一些基准测试结果已经由 test.data.table()
报告,但该代码尚未连接到级别匹配。
对于常规整数列,pandas是否比 data.table
更快?这应该是隔离算法本身vs因素问题的方法。
此外, data.table
有时间序列合并。两个方面:i)多列有序键,例如(id,datetime)ii)快速流行连接( roll = TRUE
)aka last观察结束。
我需要一些时间来确认,因为它是我第一次看到的比较 data.table
从2012年7月发布的data.table v1.8.0更新
- 内部函数sortedmatch()移除并替换为chmatch()
类型因子。当因子列的级别的数量
大(例如> 10,000)时,该
初步步骤导致(已知的)显着减速。加入了
加入四个这样的列的测试,如Wes McKinney
(Python包Pandas的作者)所证明的。匹配100万个字符串,其中
(其中60万个是唯一的)现在从16s减少到0.5s。
也在该版本中是:
-
字符列现在允许在键中,并且优先于
因子。 data.table()和setkey()不再强制字符到
因子。仍然支持因素。实现FR#1493,FR#1224
和(部分)FR#951。 - 新功能chmatch of match()
和%用于字符向量。 R的内部字符串缓存是
(不使用散列表)。
截至目前, 2013年9月data.table是v1.8.10在CRAN,我们正在努力v1.9.0。 NEWS 但是正如我最初写的那样: 因此,两个字符列的Pandas equi连接可能仍然比数据快。表。因为它听起来像是混合两列。 data.table不会哈希键,因为它有主要的有序连接。 data.table中的键字面上只是排序顺序(类似于SQL中的聚簇索引;即,这是数据在RAM中的排序方式)。在列表中是添加辅助键,例如。 总而言之,这个特殊的双字符列测试突出显示的明亮的速度差异超过10,000个唯一字符串现在不好,因为已知的问题已经修复。 I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis). Why is Here's the R code and the Python code used to benchmark the various packages. It looks like Wes may have discovered a known issue in Does Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by Are pandas merges faster than Also, I'll need some time to confirm as it's the first I've seen of the comparison to UPDATE from data.table v1.8.0 released July 2012 also in that release was : character columns are now allowed in keys and are preferred to
factor. data.table() and setkey() no longer coerce character to
factor. Factors are still supported. Implements FR#1493, FR#1224
and (partially) FR#951. New functions chmatch() and %chin%, faster versions of match()
and %in% for character vectors. R's internal string cache is
utilised (no hash table is built). They are about 4 times faster
than match() on the example in ?chmatch. As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live. But as I wrote originally, above : So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example. In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed. 这篇关于为什么pandas合并在python比data.table合并在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
data.table
有时间序列合并。两个方面:i)
多列排序的键如(id,datetime)ii)快速优先
join( roll = TRUE <
pandas
so much faster than data.table
? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table
without resorting to merge(X, Y, all=FALSE)
and merge(X, Y, all=TRUE)
?data.table
when the number of unique strings (levels) is large: 10,000.Rprof()
reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])
? This isn't really the join itself (the algorithm), but a preliminary step.test.data.table()
but that code isn't hooked up yet to replace the levels to levels match.data.table
for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.data.table
has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE
) a.k.a. last observation carried forward.data.table
as presented.
data.table
has time series merge in mind. Two aspects to that: i)
multi column ordered keys such as (id,datetime) ii) fast prevailing
join (roll=TRUE
) a.k.a. last observation carried forward.