在R中合并/连接data.frames的最快的方法是什么? [英] What's the fastest way to merge/join data.frames in R?

查看:120
本文介绍了在R中合并/连接data.frames的最快的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如(不知道最具代表性的例子):

  N < -  1e6 
d1< ; - data.frame(x = sample(N,N),y2 = rnorm(N)) - data.frame(x = sample(N,N),y1 = rnorm(N))
d2 <

这是我到目前为止:

  d<  -  merge(d1,d2)
#7.6 sec

library(plyr)
d< - join (d1,d2)
#2.9 sec

library(data.table)
dt1 dt2 d # 4.9 sec

库(sqldf)
sqldf()
sqldf(在d1(x)上创建索引ix1)
sqldf (x))
d < - sqldf(select * from d1 inner join d2 on d1.x = d2.x)
sqldf()
#17.4 sec

匹配方法在第二个数据框中存在唯一键时起作用。

为每个键值在第一个。如果在第二数据帧中存在重复,则匹配和合并方法不相同。匹配当然是更快,因为它不会做那么多。特别是它从来不会查找重复的键。 (继续后面的代码)

  DF1 = data.frame(a = c(1,1,2,2),b = 1:4)
DF2 = data.frame(b = c(1,2,3,3,4),c = letters [1:5])
merge(DF1,DF2)
bac
1 1 1 a
2 2 1 b
3 3 2 c
4 3 2 d
5 4 2 e
DF1 $ c = DF2 $ c [match(DF1 $ b,DF2 $ b)]
DF1 $ c
[1] abce
级别:abcde

> DF1
abc
1 1 1 a
2 1 2 b
3 2 3 c
4 2 4 e

在问题中发布的sqldf代码中,可能看起来在两个表上使用了索引,但实际上,它们放在表覆盖在sql select之前运行,并且,在一定程度上,说明为什么它这么慢。 sqldf的想法是,您的R会话中的数据帧构成数据库,而不是sqlite中的表。因此,每次代码引用一个非限定的表名,它将在您的R工作区中查找它,而不是在sqlite的主数据库中。因此,所显示的select语句从工作空间中读取d1和d2到sqlite的主数据库中,从而破坏了那些存在索引的数据库。因此,它执行没有索引的连接。如果你想使用在sqlite的主数据库中的d1和d2的版本,你必须将它们称为main.d1和main.d2,而不是d1和d2。此外,如果您试图使其运行尽可能快,请注意,简单的连接不能使用两个表上的索引,因此您可以节省创建一个索引的时间。在下面的代码中,我们说明了这些点。



值得注意的是,精确的计算可以对哪个包最快是一个巨大的区别。例如,我们在下面做一个合并和聚合。我们看到两者的结果几乎相反。在第一个例子中,从最快到最慢,我们得到:data.table,plyr,merge和sqldf,而在第二个例子中,sqldf,aggregate,data.table和plyr - 几乎与第一个相反。在第一个例子中,sqldf比data.table慢3倍,在第二个例子中,它比plyr快200倍,比data.table快100倍。下面我们显示输入代码,合并的输出定时和聚合的输出定时。它还值得注意sqldf基于数据库,因此可以处理大于R可以处理的对象(如果您使用sqldf的dbname参数),而其他方法仅限于在主内存中处理。此外,我们已经使用sqlite说明了sqldf,但它也支持H2和PostgreSQL数据库。

  library(plyr)
库(data.table)
库(sqldf)

set.seed(123)
N < - 1e5
d1 < x = sample(N,N),y1 = rnorm(N))
d2
g1 g2 d < g1,g2)

库(rbenchmark)

benchmark(replications = 1,order =elapsed,
merge = merge(d1,d2),
plyr = join(d1,d2),
data.table = {
dt1 dt2 < .table(d2,key =x)
data.frame(dt1 [dt2,list(x,y1,y2 = dt2 $ y2)])
},
sqldf = sqldf (c(在d1(x)上创建索引ix1,
select * from main.d1使用(x)联接d2))


set.seed (123)
N <-1e5
g1 g2 d < - data.frame(x = sample(N,N),y = rnorm(N),g1,g2)

benchmark(replications = 1,order =elapsed ,
aggregate = aggregate(d [c(x,y)],d [c(g1,g2)],mean),
data.table = {
dt< - data.table(d,key =g1,g2)
dt [,colMeans(cbind(x,y))by =g1,g2] $ b $ (b),
plyr = ddply(d,。(g1,g2),summarize,avx = mean(x),avy = mean(y)),
sqldf = sqldf ix on d(g1,g2),
select g1,g2,avg(x),avg(y)from main.d group by g1,g2))

比较合并计算的两个基准调用的输出是:

 加入者:x 
测试复制已过相对user.self sys.self user.child sys.child
3 data.table 1 0.34 1.000000 0.31 0.01 NA NA
2 plyr 1 0.44 1.294118 0.39 0.02 NA NA
1合并1 1.17 3.441176 1.10 0.04 NA NA
4 sqldf 1 3.34 9.823529 3.24 0.04 NA NA

基准调用的输出比较聚合计算是:

 测试复制已过相对user.self sys.self user.child sys.child 
4 sqldf 1 2.81 1.000000 2.73 0.02 NA NA
1 aggregate 1 14.89 5.298932 14.89 0.00 NA NA
2 data.table 1 132.46 47.138790 131.70 0.08 NA NA
3 plyr 1 212.69 75.690391 211.57 0.56 NA NA


For example (not sure if most representative example though):

N <- 1e6
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

This is what I've got so far:

d <- merge(d1,d2)
# 7.6 sec

library(plyr)
d <- join(d1,d2)
# 2.9 sec

library(data.table)
dt1 <- data.table(d1, key="x")
dt2 <- data.table(d2, key="x")
d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
# 4.9 sec

library(sqldf)
sqldf()
sqldf("create index ix1 on d1(x)")
sqldf("create index ix2 on d2(x)")
d <- sqldf("select * from d1 inner join d2 on d1.x=d2.x")
sqldf()
# 17.4 sec

解决方案

The match approach works when there is a unique key in the second data frame for each key value in the first. If there are duplicates in the second data frame then the match and merge approaches are not the same. Match is, of course, faster since it is not doing as much. In particular it never looks for duplicate keys. (continued after code)

DF1 = data.frame(a = c(1, 1, 2, 2), b = 1:4)
DF2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5])
merge(DF1, DF2)
    b a c
  1 1 1 a
  2 2 1 b
  3 3 2 c
  4 3 2 d
  5 4 2 e
DF1$c = DF2$c[match(DF1$b, DF2$b)]
DF1$c
[1] a b c e
Levels: a b c d e

> DF1
  a b c
1 1 1 a
2 1 2 b
3 2 3 c
4 2 4 e

In the sqldf code that was posted in the question, it might appear that indexes were used on the two tables but, in fact, they are placed on tables which were overwritten before the sql select ever runs and that, in part, accounts for why its so slow. The idea of sqldf is that the data frames in your R session constitute the data base, not the tables in sqlite. Thus each time the code refers to an unqualified table name it will look in your R workspace for it -- not in sqlite's main database. Thus the select statement that was shown reads d1 and d2 from the workspace into sqlite's main database clobbering the ones that were there with the indexes. As a result it does a join with no indexes. If you wanted to make use of the versions of d1 and d2 that were in sqlite's main database you would have to refer to them as main.d1 and main.d2 and not as d1 and d2. Also, if you are trying to make it run as fast as possible then note that a simple join can't make use of indexes on both tables so you can save the time of creating one of the indexes. In the code below we illustrate these points.

Its worthwhile to notice that the precise computation can make a huge difference on which package is fastest. For example, we do a merge and an aggregate below. We see that the results are nearly reversed for the two. In the first example from fastest to slowest we get: data.table, plyr, merge and sqldf whereas in the second example sqldf, aggregate, data.table and plyr -- nearly the reverse of the first one. In the first example sqldf is 3x slower than data.table and in the second its 200x faster than plyr and 100 times faster than data.table. Below we show the input code, the output timings for the merge and the output timings for the aggregate. Its also worthwhile noting that sqldf is based on a database and therefore can handle objects larger than R can handle (if you use the dbname argument of sqldf) while the other approaches are limited to processing in main memory. Also we have illustrated sqldf with sqlite but it also supports the H2 and PostgreSQL databases as well.

library(plyr)
library(data.table)
library(sqldf)

set.seed(123)
N <- 1e5
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(d1, g1, g2)

library(rbenchmark)

benchmark(replications = 1, order = "elapsed",
   merge = merge(d1, d2),
   plyr = join(d1, d2),
   data.table = { 
      dt1 <- data.table(d1, key = "x")
      dt2 <- data.table(d2, key = "x")
      data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
      },
   sqldf = sqldf(c("create index ix1 on d1(x)",
      "select * from main.d1 join d2 using(x)"))
)

set.seed(123)
N <- 1e5
g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

benchmark(replications = 1, order = "elapsed",
   aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean), 
   data.table = {
      dt <- data.table(d, key = "g1,g2")
      dt[, colMeans(cbind(x, y)), by = "g1,g2"]
   },
   plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),
   sqldf = sqldf(c("create index ix on d(g1, g2)",
      "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))
)

The outputs from the two benchmark call comparing the merge calculations are:

Joining by: x
        test replications elapsed relative user.self sys.self user.child sys.child
3 data.table            1    0.34 1.000000      0.31     0.01         NA        NA
2       plyr            1    0.44 1.294118      0.39     0.02         NA        NA
1      merge            1    1.17 3.441176      1.10     0.04         NA        NA
4      sqldf            1    3.34 9.823529      3.24     0.04         NA        NA

The output from the benchmark call comparing the aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child
4      sqldf            1    2.81  1.000000      2.73     0.02         NA        NA
1  aggregate            1   14.89  5.298932     14.89     0.00         NA        NA
2 data.table            1  132.46 47.138790    131.70     0.08         NA        NA
3       plyr            1  212.69 75.690391    211.57     0.56         NA        NA

这篇关于在R中合并/连接data.frames的最快的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆