将数据表匹配五列,以更改另一列中的值 [英] Matching Data Tables by five columns to change a value in another column

查看:173
本文介绍了将数据表匹配五列,以更改另一列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有Ubuntu 16.04并在终端运行R.
我使用大数据表,一个有75百万行和11列(dt1),另一个有700万行和7列(dt2)。所有值都是数字。两个表都有id列。我需要找到第一个列中的所有行,其中五列的值与第二列中的五列相同,并将这些行的第一个数据表'id'值更改为第二个数据表中的行。在两个数据表中,比较列具有相同的名称,假设它们是V1,V2,V3,V4和V5。我已经将第二个数据表转换为数据帧格式,所以我可以使用它的'id'作为索引。我已经尝试了1000个第一行,花了40分钟。

I have Ubuntu 16.04 and run R in terminal. I am working with big data tables, one has 75 millions rows and 11 columns (dt1) and another has 7 million rows and 7 columns (dt2). All values are numeric. Both tables have 'id' column'. I need to find all rows in the first one which have the same values for five columns as five columns in the second one, and change in the first data table 'id' value for these rows to the one in the second data table. In both data tables the compared columns have the same name, let us say that they are V1, V2, V3, V4 and V5. I've converted second data table to data frame format, so I can use its 'id' as index . I've tried it for 1000 first rows and it took 40 minutes.

for (i in 1:1000) {
    dt1[(V1==dt2[i,V1] & V2==dt2[i,V2] &
         V3==dt2[i,V3] & V4==dt2[i,V4] &
         V5==dt2[i,V5]), id:=i]
}

我要并行化它,但是由于内存收缩,我只能使用2或3核。显然,这是不够的。有没有快速和高效的方法来做我的家庭comp?如果在AWS上这样做,什么样的窍门在那里有用?具体来说,我可以同时使用多少内核?

I'm going to parallelize it, but due to memory constrictions I can use only 2 or 3 cores. Clearly it won't be sufficient. Are there quick and efficient ways to do it on my home comp? If to do it on AWS, what kind of tricks are useful there? In particular, how many cores may I use there simultaneously?

推荐答案

在R中,它们通常比替代向量化解决方案慢得多。

In R it is always preferable to avoid loops wherever possible, as they are usually much slower than alternative vectorized solutions.

此操作可以使用data.table连接完成。基本上,当您运行

This operation can be done with a data.table join. Basically, when you run

dt1[dt2];

您正在执行两个data.tables之间的右连接。 dt1 的预设键列决定要加入哪些列。如果 dt1 没有预置键,则操作失败。但您可以指定参数以手动选择键列:

you are performing a right-join between the two data.tables. The preset key columns of dt1 determine which columns to join on. If dt1 has no preset key, the operation fails. But you can specify the on argument to manually select the key columns on-the-fly:

key <- paste0('V',1:5);
dt1[dt2,on=key];

(另一种方法当然是使用 setkey ) setkeyv()。)

(The alternative of course is to preset a key, using either setkey() or setkeyv().)

上述操作实际上只是返回合并表包含来自 dt1 dt2 的数据,这不是您想要的。但我们可以使用data.table索引函数的 j 参数和:= 就地赋值语法将 dt2 id 列分配给 id dt1 的列。因为我们有名称冲突,我们必须使用 i.id 引用 id 列> dt2 ,而未修改的名称 id 仍指向 id c $ c> dt1 。这只是data.table提供的用于消除冲突名称的歧义的机制。因此,您正在寻找:

The above operation will actually just return a merged table containing data from both dt1 and dt2, which is not what you want. But we can make use of the j argument of the data.table indexing function and the := in-place assignment syntax to assign the id column of dt2 to the id column of dt1. Because we have a name conflict, we must use i.id to reference the id column of dt2, while the unmodified name id still refers to the id column of dt1. This is simply the mechanism provided by data.table for disambiguating conflicting names. Hence, you're looking for:

dt1[dt2,on=key,id:=i.id];

下面是一个仅使用两个键列和仅几行数据我还生成了键以包括一些不匹配的行,只是为了演示不匹配的行将其操作不会改变其id。

Here's an example that uses only two key columns and just a few rows of data (for simplicity). I also generated the keys to include some non-matching rows, just to demonstrate that the non-matching rows will have their ids left untouched by the operation.

set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
##     id V1 V2      blah1
##  1:  1  1  1 -0.6264538
##  2:  2  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4:  4  1  2  1.5952808
##  5:  5  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7:  7  1  3  0.4874291
##  8:  8  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432
dt2;
##    id V1 V2       blah2
## 1: 13  1  1 -0.62124058
## 2: 14  2  1 -2.21469989
## 3: 15  1  2  1.12493092
## 4: 16  2  2 -0.04493361
## 5: 17  1  3 -0.01619026
## 6: 18  2  3  0.94383621
key <- paste0('V',1:2);
dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432

这篇关于将数据表匹配五列,以更改另一列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆