高效合并大型data.tables [英] Efficiently merging large data.tables
问题描述
我有两个相当大的 data.table
对象要合并.
dt1
在 5 列上有 500.000.000 个观察值.dt2
在 2 列上有 300.000 个观察值.
两个对象具有相同的key
,称为id
.
我想将dt2
中的left_join
信息放入dt1
.
例如:
dt1 <- data.table(id = c(1, 2, 3, 4),x1 = c(12, 13, 14, 15),x2 = c(5, 6, 7, 8),x3 = c(33, 44, 55, 66),x4 = c(123, 123, 123, 123))dt2 <- data.table(id = c(1, 2, 3, 4),x5 = c(555, 666, 777, 888))设置键(dt1,id)设置键(dt2,id)dt2[dt1, on="id"]>dt2[dt1, on="id"]身份证 x5 x1 x2 x3 x41: 1 555 12 5 33 1232: 2 666 13 6 44 1233: 3 777 14 7 55 1234:4 888 15 8 66 123
但是,当合并我的原始数据时,R 无法再分配内存.然而,合并的输出适合 RAM.
完成这种大型合并的最有效方式(速度与内存限制)是什么?
我们应该拆分-应用-组合吗?
我们应该使用数据库库来完成这项工作吗?
你会如何有效地做到这一点?
键控赋值应该节省内存.
dt1[dt2, on = "id", x5 := x5]
<块引用>
我们应该使用数据库库来完成这项工作吗?
这可能是个好主意.如果设置和使用数据库对您来说很痛苦,请尝试使用 RSQLite
包.很简单.
我的实验
tl;dr:以玩具示例为例,键控分配使用的内存比合并替换少 55%.
我编写了两个脚本,每个脚本都提供了一个设置脚本,dt-setup.R
来创建 dt1
和 dt2
.第一个脚本 dt-merge.R
通过merge"方法更新了 dt1
.第二个,dt-keyed-assign.R
,使用键控赋值.两个脚本都使用 Rprofmem()
函数记录了内存分配.
为了不折磨我的笔记本电脑,我将 dt1
设为 500,000 行,dt2
设为 3,000 行.
脚本:
# dt-setup.R图书馆(数据表)set.seed(9474)id_space <- seq_len(3000)dt1 <- 数据表(id = 样本(id_space,500000,替换 = TRUE),x1 = runif(500000),x2 = runif(500000),x3 = runif(500000),x4 = runif(500000))dt2 <- 数据表(id = id_space,x5 = 11 * id_space)设置键(dt1,id)设置键(dt2,id)
# dt-merge.R源(DT-setup.R")Rprofmem(filename = "dt-merge.out")dt1 <- dt2[dt1, on = "id"]Rprofmem(NULL)
# dt-keyed-assign.R源(DT-setup.R")Rprofmem(filename = "dt-keyed-assign.out")dt1[dt2, on = "id", x5 := x5]Rprofmem(NULL)
将所有三个脚本都放在我的工作目录中,我在单独的 R 进程中运行每个加入脚本.
system2("Rscript", "dt-merge.R")system2("Rscript", "dt-keyed-assign.R")
我认为输出文件中的行通常遵循模式"
.我还没有找到好的文档.然而,前面的数字永远不会低于 128,这是默认的最小字节数,低于该数 R 不会 malloc
向量.
请注意,并非所有这些分配都会增加总 R 使用的内存. R 可能会在垃圾回收后重用它已有的一些内存.因此,这不是衡量在任何特定时间使用了多少内存的好方法.然而,如果我们假设垃圾收集行为是独立的,它确实可以作为脚本之间的比较.
内存报告的一些示例行:
cat(readLines("dt-merge.out", 5), sep = "\n")# 90208 :获取"["#528448:获取"["#528448:获取"["#1072:获取"["# 20608 :"get" "["
还有像 new page:"get" "["
这样的用于页面分配的行.
幸运的是,这些很容易解析.
parse_memory_report <- function(path) {报告 <- readLines(path)new_pages <- startsWith(report, "新页面:")分配 <- as.numeric(gsub(":.*", "", report[!new_pages]))total_malloced <- sum(as.numeric(allocations))信息("摘要", 路径, ":\n",sum(new_pages), "新页面分配\n",sum(as.numeric(allocations)), "bytes malloced")}parse_memory_report("dt-merge.out")# dt-merge.out 摘要:# 分配了 12 个新页面# 32098912 字节被分配parse_memory_report("dt-keyed-assign.out")# dt-keyed-assign.out 摘要:# 分配了 13 个新页面# 14284272 字节被分配
重复实验时我得到了完全相同的结果.
因此键控分配还有一页分配.页面的默认字节大小是 2000.我不确定 malloc
是如何工作的,而且 2000 相对于所有分配来说很小,所以我将忽略这种差异.如果这是愚蠢的,请惩罚我.
因此,忽略页面,键控分配分配的内存比合并少 55%.
I have two fairly large data.table
objects that I want to merge.
dt1
has 500.000.000 observations on 5 columns.dt2
has 300.000 observations on 2 columns.
Both objects have the same key
called id
.
I want to left_join
information from dt2
into dt1
.
For example:
dt1 <- data.table(id = c(1, 2, 3, 4),
x1 = c(12, 13, 14, 15),
x2 = c(5, 6, 7, 8),
x3 = c(33, 44, 55, 66),
x4 = c(123, 123, 123, 123))
dt2 <- data.table(id = c(1, 2, 3, 4),
x5 = c(555, 666, 777, 888))
setkey(dt1, id)
setkey(dt2, id)
dt2[dt1, on="id"]
> dt2[dt1, on="id"]
id x5 x1 x2 x3 x4
1: 1 555 12 5 33 123
2: 2 666 13 6 44 123
3: 3 777 14 7 55 123
4: 4 888 15 8 66 123
However, when merging my original data R can't allocate memory anymore. Yet, the output of the merge fits in the RAM.
What is the most efficient (speed vs. memory limitations) way of getting this large merge done?
Should we split-apply-combine?
Should we use a DB library to get this done?
How would you do this efficiently?
Keyed assignment should save memory.
dt1[dt2, on = "id", x5 := x5]
Should we use a DB library to get this done?
That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite
package. It's pretty simple.
My experiment
tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.
I wrote two scripts that each sourced a setup script, dt-setup.R
to create dt1
and dt2
. The first script, dt-merge.R
, updated dt1
through the "merge" method. The second, dt-keyed-assign.R
, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem()
function.
To not torture my laptop, I'm having dt1
be 500,000 rows and dt2
3,000 rows.
Scripts:
# dt-setup.R
library(data.table)
set.seed(9474)
id_space <- seq_len(3000)
dt1 <- data.table(
id = sample(id_space, 500000, replace = TRUE),
x1 = runif(500000),
x2 = runif(500000),
x3 = runif(500000),
x4 = runif(500000)
)
dt2 <- data.table(
id = id_space,
x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)
# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)
# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)
With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.
system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")
I think the lines in the output files generally follow the pattern "<bytes> :<call stack>"
. I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc
for vectors.
Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.
Some sample lines of the memory report:
cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "["
# 528448 :"get" "["
# 528448 :"get" "["
# 1072 :"get" "["
# 20608 :"get" "["
There are also lines like new page:"get" "["
for page allocations.
Luckily, these are simple to parse.
parse_memory_report <- function(path) {
report <- readLines(path)
new_pages <- startsWith(report, "new page:")
allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
total_malloced <- sum(as.numeric(allocations))
message(
"Summary of ", path, ":\n",
sum(new_pages), " new pages allocated\n",
sum(as.numeric(allocations)), " bytes malloced"
)
}
parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced
parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced
I got exactly the same results when repeating the experiment.
So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc
works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.
So, ignoring pages, keyed assignment allocated 55% less memory than the merge.
这篇关于高效合并大型data.tables的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!