高效合并大型data.tables [英] Efficiently merging large data.tables

查看:57
本文介绍了高效合并大型data.tables的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个相当大的 data.table 对象要合并.

  • dt1 在 5 列上有 500.000.000 个观察值.
  • dt2 在 2 列上有 300.000 个观察值.

两个对象具有相同的key,称为id.

我想将dt2中的left_join信息放入dt1.

例如:

dt1 <- data.table(id = c(1, 2, 3, 4),x1 = c(12, 13, 14, 15),x2 = c(5, 6, 7, 8),x3 = c(33, 44, 55, 66),x4 = c(123, 123, 123, 123))dt2 <- data.table(id = c(1, 2, 3, 4),x5 = c(555, 666, 777, 888))设置键(dt1,id)设置键(dt2,id)dt2[dt1, on="id"]>dt2[dt1, on="id"]身份证 x5 x1 x2 x3 x41: 1 555 12 5 33 1232: 2 666 13 6 44 1233: 3 777 14 7 55 1234:4 888 15 8 66 123

但是,当合并我的原始数据时,R 无法再分配内存.然而,合并的输出适合 RAM.

完成这种大型合并的最有效方式(速度与内存限制)是什么?

我们应该拆分-应用-组合吗?

我们应该使用数据库库来完成这项工作吗?

你会如何有效地做到这一点?

解决方案

键控赋值应该节省内存.

dt1[dt2, on = "id", x5 := x5]

<块引用>

我们应该使用数据库库来完成这项工作吗?

这可能是个好主意.如果设置和使用数据库对您来说很痛苦,请尝试使用 RSQLite 包.很简单.

<小时>

我的实验

tl;dr:以玩具示例为例,键控分配使用的内存比合并替换少 55%.

我编写了两个脚本,每个脚本都提供了一个设置脚本,dt-setup.R 来创建 dt1dt2.第一个脚本 dt-merge.R 通过merge"方法更新了 dt1.第二个,dt-keyed-assign.R,使用键控赋值.两个脚本都使用 Rprofmem() 函数记录了内存分配.

为了不折磨我的笔记本电脑,我将 dt1 设为 500,000 行,dt2 设为 3,000 行.

脚本:

# dt-setup.R图书馆(数据表)set.seed(9474)id_space <- seq_len(3000)dt1 <- 数据表(id = 样本(id_space,500000,替换 = TRUE),x1 = runif(500000),x2 = runif(500000),x3 = runif(500000),x4 = runif(500000))dt2 <- 数据表(id = id_space,x5 = 11 * id_space)设置键(dt1,id)设置键(dt2,id)

# dt-merge.R源(DT-setup.R")Rprofmem(filename = "dt-merge.out")dt1 <- dt2[dt1, on = "id"]Rprofmem(NULL)

# dt-keyed-assign.R源(DT-setup.R")Rprofmem(filename = "dt-keyed-assign.out")dt1[dt2, on = "id", x5 := x5]Rprofmem(NULL)

将所有三个脚本都放在我的工作目录中,我在单独的 R 进程中运行每个加入脚本.

system2("Rscript", "dt-merge.R")system2("Rscript", "dt-keyed-assign.R")

我认为输出文件中的行通常遵循模式" :".我还没有找到好的文档.然而,前面的数字永远不会低于 128,这是默认的最小字节数,低于该数 R 不会 malloc 向量.

请注意,并非所有这些分配都会增加 R 使用的内存. R 可能会在垃圾回收后重用它已有的一些内存.因此,这不是衡量在任何特定时间使用了多少内存的好方法.然而,如果我们假设垃圾收集行为是独立的,它确实可以作为脚本之间的比较.

内存报告的一些示例行:

cat(readLines("dt-merge.out", 5), sep = "\n")# 90208 :获取"["#528448:获取"["#528448:获取"["#1072:获取"["# 20608 :"get" "["

还有像 new page:"get" "[" 这样的用于页面分配的行.

幸运的是,这些很容易解析.

parse_memory_report <- function(path) {报告 <- readLines(path)new_pages <- startsWith(report, "新页面:")分配 <- as.numeric(gsub(":.*", "", report[!new_pages]))total_malloced <- sum(as.numeric(allocations))信息("摘要", 路径, ":\n",sum(new_pages), "新页面分配\n",sum(as.numeric(allocations)), "bytes malloced")}parse_memory_report("dt-merge.out")# dt-merge.out 摘要:# 分配了 12 个新页面# 32098912 字节被分配parse_memory_report("dt-keyed-assign.out")# dt-keyed-assign.out 摘要:# 分配了 13 个新页面# 14284272 字节被分配

重复实验时我得到了完全相同的结果.

因此键控分配还有一页分配.页面的默认字节大小是 2000.我不确定 malloc 是如何工作的,而且 2000 相对于所有分配来说很小,所以我将忽略这种差异.如果这是愚蠢的,请惩罚我.

因此,忽略页面,键控分配分配的内存比合并少 55%.

I have two fairly large data.table objects that I want to merge.

  • dt1 has 500.000.000 observations on 5 columns.
  • dt2 has 300.000 observations on 2 columns.

Both objects have the same key called id.

I want to left_join information from dt2 into dt1.

For example:

dt1  <- data.table(id = c(1, 2, 3, 4),
               x1 = c(12, 13, 14, 15),
               x2 = c(5, 6, 7, 8),
               x3 = c(33, 44, 55, 66),
               x4 = c(123, 123, 123, 123))

dt2 <- data.table(id = c(1, 2, 3, 4),
              x5 = c(555, 666, 777, 888))
setkey(dt1, id)
setkey(dt2, id)

dt2[dt1, on="id"] 

> dt2[dt1, on="id"]
   id  x5 x1 x2 x3  x4
1:  1 555 12  5 33 123
2:  2 666 13  6 44 123
3:  3 777 14  7 55 123
4:  4 888 15  8 66 123

However, when merging my original data R can't allocate memory anymore. Yet, the output of the merge fits in the RAM.

What is the most efficient (speed vs. memory limitations) way of getting this large merge done?

Should we split-apply-combine?

Should we use a DB library to get this done?

How would you do this efficiently?

解决方案

Keyed assignment should save memory.

dt1[dt2, on = "id", x5 := x5]

Should we use a DB library to get this done?

That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite package. It's pretty simple.


My experiment

tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.

I wrote two scripts that each sourced a setup script, dt-setup.R to create dt1 and dt2. The first script, dt-merge.R, updated dt1 through the "merge" method. The second, dt-keyed-assign.R, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem() function.

To not torture my laptop, I'm having dt1 be 500,000 rows and dt2 3,000 rows.

Scripts:

# dt-setup.R
library(data.table)

set.seed(9474)
id_space <- seq_len(3000)
dt1  <- data.table(
  id = sample(id_space, 500000, replace = TRUE),
  x1 = runif(500000),
  x2 = runif(500000),
  x3 = runif(500000),
  x4 = runif(500000)
)
dt2 <- data.table(
  id = id_space,
  x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)

# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)

# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)

With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.

system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")

I think the lines in the output files generally follow the pattern "<bytes> :<call stack>". I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc for vectors.

Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.

Some sample lines of the memory report:

cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "[" 
# 528448 :"get" "[" 
# 528448 :"get" "[" 
# 1072 :"get" "[" 
# 20608 :"get" "["

There are also lines like new page:"get" "[" for page allocations.

Luckily, these are simple to parse.

parse_memory_report <- function(path) {
  report <- readLines(path)
  new_pages <- startsWith(report, "new page:")
  allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
  total_malloced <- sum(as.numeric(allocations))
  message(
    "Summary of ", path, ":\n",
    sum(new_pages), " new pages allocated\n",
    sum(as.numeric(allocations)), " bytes malloced"
  )
}

parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced

parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced

I got exactly the same results when repeating the experiment.

So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.

So, ignoring pages, keyed assignment allocated 55% less memory than the merge.

这篇关于高效合并大型data.tables的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆