在 R 中合并/加入 data.frames 的最快方法是什么? [英] What's the fastest way to merge/join data.frames in R?

查看:24
本文介绍了在 R 中合并/加入 data.frames 的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如(不确定是否最具代表性的例子):

For example (not sure if most representative example though):

N <- 1e6
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

这是我目前得到的:

d <- merge(d1,d2)
# 7.6 sec

library(plyr)
d <- join(d1,d2)
# 2.9 sec

library(data.table)
dt1 <- data.table(d1, key="x")
dt2 <- data.table(d2, key="x")
d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
# 4.9 sec

library(sqldf)
sqldf()
sqldf("create index ix1 on d1(x)")
sqldf("create index ix2 on d2(x)")
d <- sqldf("select * from d1 inner join d2 on d1.x=d2.x")
sqldf()
# 17.4 sec

推荐答案

当第一个数据帧中的每个键值在第二个数据帧中都有一个唯一键时,匹配方法有效.如果第二个数据框中存在重复项,则匹配和合并方法不同.当然,Match 更快,因为它没有做那么多.特别是它从不寻找重复的键.(代码后续)

The match approach works when there is a unique key in the second data frame for each key value in the first. If there are duplicates in the second data frame then the match and merge approaches are not the same. Match is, of course, faster since it is not doing as much. In particular it never looks for duplicate keys. (continued after code)

DF1 = data.frame(a = c(1, 1, 2, 2), b = 1:4)
DF2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5])
merge(DF1, DF2)
    b a c
  1 1 1 a
  2 2 1 b
  3 3 2 c
  4 3 2 d
  5 4 2 e
DF1$c = DF2$c[match(DF1$b, DF2$b)]
DF1$c
[1] a b c e
Levels: a b c d e

> DF1
  a b c
1 1 1 a
2 1 2 b
3 2 3 c
4 2 4 e

在问题中发布的 sqldf 代码中,似乎在两个表上使用了索引,但实际上,它们被放置在在 sql select 运行之前被覆盖的表上,并且部分地,解释了为什么它这么慢.sqldf 的想法是 R 会话中的数据框构成数据库,而不是 sqlite 中的表.因此,每次代码引用一个不合格的表名时,它都会在您的 R 工作区中查找它——而不是在 sqlite 的主数据库中.因此,显示的 select 语句将 d1 和 d2 从工作区读取到 sqlite 的主数据库中,破坏了那些带有索引的数据库.结果,它执行了一个没有索引的连接.如果您想使用sqlite 主数据库中的d1 和d2 版本,您必须将它们称为main.d1 和main.d2 而不是d1 和d2.另外,如果您试图让它运行得尽可能快,那么请注意简单连接不能在两个表上都使用索引,因此您可以节省创建其中一个索引的时间.在下面的代码中,我们说明了这些要点.

In the sqldf code that was posted in the question, it might appear that indexes were used on the two tables but, in fact, they are placed on tables which were overwritten before the sql select ever runs and that, in part, accounts for why its so slow. The idea of sqldf is that the data frames in your R session constitute the data base, not the tables in sqlite. Thus each time the code refers to an unqualified table name it will look in your R workspace for it -- not in sqlite's main database. Thus the select statement that was shown reads d1 and d2 from the workspace into sqlite's main database clobbering the ones that were there with the indexes. As a result it does a join with no indexes. If you wanted to make use of the versions of d1 and d2 that were in sqlite's main database you would have to refer to them as main.d1 and main.d2 and not as d1 and d2. Also, if you are trying to make it run as fast as possible then note that a simple join can't make use of indexes on both tables so you can save the time of creating one of the indexes. In the code below we illustrate these points.

值得注意的是,精确的计算会对哪个包最快产生巨大的影响.例如,我们在下面进行合并和聚合.我们看到两者的结果几乎相反.在第一个示例中,从最快到最慢,我们得到:data.table、plyr、merge 和 sqldf 而在第二个示例中 sqldf、aggregate、data.table 和 plyr——几乎与第一个相反.在第一个示例中,sqldf 比 data.table 慢 3 倍,在第二个示例中,它比 plyr 快 200 倍,比 data.table 快 100 倍.下面我们显示输入代码、合并的输出时间和聚合的输出时间.还值得注意的是,sqldf 基于数据库,因此可以处理比 R 可以处理的对象更大的对象(如果您使用 sqldf 的 dbname 参数),而其他方法仅限于在主内存中处理.我们也用 sqlite 说明了 sqldf 但它也支持 H2 和 PostgreSQL 数据库.

Its worthwhile to notice that the precise computation can make a huge difference on which package is fastest. For example, we do a merge and an aggregate below. We see that the results are nearly reversed for the two. In the first example from fastest to slowest we get: data.table, plyr, merge and sqldf whereas in the second example sqldf, aggregate, data.table and plyr -- nearly the reverse of the first one. In the first example sqldf is 3x slower than data.table and in the second its 200x faster than plyr and 100 times faster than data.table. Below we show the input code, the output timings for the merge and the output timings for the aggregate. Its also worthwhile noting that sqldf is based on a database and therefore can handle objects larger than R can handle (if you use the dbname argument of sqldf) while the other approaches are limited to processing in main memory. Also we have illustrated sqldf with sqlite but it also supports the H2 and PostgreSQL databases as well.

library(plyr)
library(data.table)
library(sqldf)

set.seed(123)
N <- 1e5
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(d1, g1, g2)

library(rbenchmark)

benchmark(replications = 1, order = "elapsed",
   merge = merge(d1, d2),
   plyr = join(d1, d2),
   data.table = { 
      dt1 <- data.table(d1, key = "x")
      dt2 <- data.table(d2, key = "x")
      data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
      },
   sqldf = sqldf(c("create index ix1 on d1(x)",
      "select * from main.d1 join d2 using(x)"))
)

set.seed(123)
N <- 1e5
g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

benchmark(replications = 1, order = "elapsed",
   aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean), 
   data.table = {
      dt <- data.table(d, key = "g1,g2")
      dt[, colMeans(cbind(x, y)), by = "g1,g2"]
   },
   plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),
   sqldf = sqldf(c("create index ix on d(g1, g2)",
      "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))
)

比较合并计算的两个基准调用的输出是:

The outputs from the two benchmark call comparing the merge calculations are:

Joining by: x
        test replications elapsed relative user.self sys.self user.child sys.child
3 data.table            1    0.34 1.000000      0.31     0.01         NA        NA
2       plyr            1    0.44 1.294118      0.39     0.02         NA        NA
1      merge            1    1.17 3.441176      1.10     0.04         NA        NA
4      sqldf            1    3.34 9.823529      3.24     0.04         NA        NA

比较聚合计算的基准调用的输出是:

The output from the benchmark call comparing the aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child
4      sqldf            1    2.81  1.000000      2.73     0.02         NA        NA
1  aggregate            1   14.89  5.298932     14.89     0.00         NA        NA
2 data.table            1  132.46 47.138790    131.70     0.08         NA        NA
3       plyr            1  212.69 75.690391    211.57     0.56         NA        NA

这篇关于在 R 中合并/加入 data.frames 的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆