合并data.tables并对共享列求和 [英] combine data.tables and sum the shared column

查看：45 发布时间：2020/10/15 19:30:30 r data.table

本文介绍了合并data.tables并对共享列求和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些大型数据集，正在尝试 data.table 进行合并，同时对匹配行上的共享列求和。我知道如何使用LHS data.table中的 [匹配行进行合并，如下所示与表 a2 ：LHS和 a ：RHS

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS

a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")

a2
    b   c
 1: a 100
 2: b 100
 3: c 100
 4: d 100
 5: e 100
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

a
    b  c
 1: a  1
 2: b  2
 3: c  3
 4: d  4
 5: e  5
 6: f  6
 7: g  7
 8: h  8
 9: i  9
10: j 10

从第二个答案开始合并数据帧，同时求和R中的公共列我看到了如何将匹配行上的列求和，如下所示：

from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:

setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
    b   c
 1: a 101
 2: b 102
 3: c 103
 4: d 104
 5: e 105
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

但是我试图保留不匹配的行。

However I am trying retain the rows that don't match as well.

或者，我可以使用 merge ，如下所示，但是我想要一个空的空间来制作一个有4行的新表减少到2行。

Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.

c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]

c
    b value
 1: a   102
 2: b   104
 3: c   106
 4: d   108
 5: e   110
 6: f     6
 7: g     7
 8: h     8
 9: i     9
10: j    10
11: k   100
12: l   100
13: m   100
14: n   100
15: o   100

最后一张表是我想要实现的，谢谢。

This last table is what I would like to achieve, Thanks in Advance.

推荐答案

<对于您追求的最终结果，code>合并可能不是很有效。由于您的两个 data.table 具有相同的结构，我建议 rbind 将它们放在一起并求和通过他们的钥匙。换句话说：

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:

rbindlist(list(a, a2))[, sum(c), b]

我使用了 rbindlist ，因为它通常在 rbind data.table s（即使您必须先放置 data.table 在列表中）。

I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).

比较大型数据集的一些时间安排：

Compare some timings on larger datasets:

library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))

system.time(rbindlist(list(a, a2))[, sum(c), b])
#   user  system elapsed 
#   0.83    0.05    0.87 

system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
#   user  system elapsed 
# 159.58    0.48  162.95 

## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166

nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

这篇关于合并data.tables并对共享列求和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

合并data.tables并对共享列求和 [英] combine data.tables and sum the shared column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

合并data.tables并对共享列求和 [英] combine data.tables and sum the shared column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭