为什么pandas合并在python比data.table合并在R? [英] Why are pandas merges in python faster than data.table merges in R?

查看:559
本文介绍了为什么pandas合并在python比data.table合并在R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了和 Python代码用于对各种程序包进行基准测试。

解决方案

看起来像Wes可能在 data.table 中发现一个已知问题:唯一字符串(级别)的数量很大:10,000。



Rprof()显示调用中花费的大部分时间 sortedmatch [lc]],levels(x [[rc]])?这不是真正的连接本身(算法),而是一个初步步骤。



最近的努力已经允许在键中的字符列,这应该通过更紧密地集成R的自己的全局字符串哈希表来解决这个问题。一些基准测试结果已经由 test.data.table()报告,但该代码尚未连接到级别匹配。



对于常规整数列,pandas是否比 data.table 更快?这应该是隔离算法本身vs因素问题的方法。



此外, data.table 时间序列合并。两个方面:i)多列有序键,例如(id,datetime)ii)快速流行连接( roll = TRUE )aka last观察结束。



我需要一些时间来确认,因为它是我第一次看到的比较 data.table






从2012年7月发布的data.table v1.8.0更新




  • 内部函数sortedmatch()移除并替换为chmatch()
    类型因子。当因子列的级别的数量
    大(例如> 10,000)时,该
    初步步骤导致(已知的)显着减速。加入了
    加入四个这样的列的测试,如Wes McKinney
    (Python包Pandas的作者)所证明的。匹配100万个字符串,其中
    (其中60万个是唯一的)现在从16s减少到0.5s。



也在该版本中是:




  • 字符列现在允许在键中,并且优先于
    因子。 data.table()和setkey()不再强制字符到
    因子。仍然支持因素。实现FR#1493,FR#1224
    和(部分)FR#951。


  • 新功能chmatch of match()
    和%用于字符向量。 R的内部字符串缓存是
    (不使用散列表)。




截至目前, 2013年9月data.table是v1.8.10在CRAN,我们正在努力v1.9.0。 NEWS




但是正如我最初写的那样:


data.table 时间序列合并。两个方面:i)
多列排序的键如(id,datetime)ii)快速优先
join( roll = TRUE <


因此,两个字符列的Pandas equi连接可能仍然比数据快。表。因为它听起来像是混合两列。 data.table不会哈希键,因为它有主要的有序连接。 data.table中的键字面上只是排序顺序(类似于SQL中的聚簇索引;即,这是数据在RAM中的排序方式)。在列表中是添加辅助键,例如。



总而言之,这个特殊的双字符列测试突出显示的明亮的速度差异超过10,000个唯一字符串现在不好,因为已知的问题已经修复。


I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Here's the R code and the Python code used to benchmark the various packages.

解决方案

It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.

Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn't hooked up yet to replace the levels to levels match.

Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

I'll need some time to confirm as it's the first I've seen of the comparison to data.table as presented.


UPDATE from data.table v1.8.0 released July 2012

  • Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

also in that release was :

  • character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.

  • New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.


But as I wrote originally, above :

data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.

In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.

这篇关于为什么pandas合并在python比data.table合并在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆