在data.tables和update data.table之间匹配唯一值和非唯一值 [英] Matching unique and non-unique values between data.tables and update data.table

查看:71
本文介绍了在data.tables和update data.table之间匹配唯一值和非唯一值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在根据唯一值和非唯一值来处理两个大型数据表之间的匹配值(每个大约2500万条记录)。我昨天发布了一个问题,如何更新值在data.table使用另一个data.table 和答案让我不知道如何可以提高这种匹配的效率。在我的示例中, dt1 包含非唯一区域代码和类型,可以是1到10(以及一些空白列)。 dt2 包含相同的非唯一区号(尽管与 dt1 中的编号不同),类型从1到10和唯一ID( U_ID )。



我想要做的是找到<$ c $在 dt1 dt2中创建 AREA_CD TYPE / code>到 dt2 U_ID dt1 。问题是dt1和dt2没有相同数量的每个唯一组合的实例。例如 dt1 中出现21次 AREA_CD 'A1'和 TYPE code>,在 dt2 中只有20次。在这些实例中,最小行数(在这种情况下为20)将应用匹配操作,使 dt1 中的1行未修改(如果 dt2 有多于 dt1 的行,则会使用 dt1 中的行数)。



这是我的数据集的示例。我的实际数据集大约有25,000,000行,包含大约10,000个唯一区域和1到10个类型。

  require )

df1 <-data.frame(AREA_CD = c(rep(A1,205),rep(A2,145) (A4,100),rep(A5,300)),TYPE = rep(1:10),ALLOCATED = 0,U_ID = 0,ID_CD = c(1:1000))
df1 $ ID_CD< - interaction(ID,df1 $ ID_CD,sep =)

df2 <-data.frame(U_ID = c(1:1000),AREA_CD = c (A5,310)),TYPE = rep(A1,200),rep(A2,155) 10),ASSIGNED = 0)
df2 $ U_ID< - 互动(U,df2 $ U_ID,sep =)

dt1< - as.data.table df1)
dt2 < - as.data.table(df2)

输出I我正在寻找如下所示:

  for(o in 1:5){
Ao& (n)(n)(n)(n)(n(n),n(n),n(n) & TYPE == i,])),nrow(with(df2,df2 [AREA_CD == Ao& TYPE == i,])))
df1 [df1 $ AREA_CD == Ao& df1 $ TYPE == i,] [1:R.Num,U_ID]< - as.character(df2 [df2 $ AREA_CD == Ao& df2 $ TYPE == i,] [1:R.Num ,U_ID])
}}

我希望有意义。

解决方案

对于 AREA_CD TYPE ,OP希望按照它们出现在两个data.tables中的顺序一一匹配这些行。例如,第一行 AREA_CD ==A1&在 dt1 中的TYPE == 1 将匹配第一行与 AREA_CD ==A1 TYPE == 1 在 dt2 ,然后第二行,依此类推。



如果添加了每个组中的行索引 I (或运行计数),则可以在连接操作中完成此操作:

 #add row indices 
dt1 [,I:= seq_len(.N),by =。(AREA_CD,TYPE)]
dt2 [,I: seq_len(.N),by =。(AREA_CD,TYPE)]

#替代代码:与上面的结果相同但更简洁
dt1 [,I:= rowid(AREA_CD,TYPE) ]
dt2 [,I:= rowid(AREA_CD,TYPE)]


#右连接(使用dt1的所有行)
dt0 < [dt1,on =。(AREA_CD,TYPE,I)]

#显示一个组的结果
dt0 [AREA_CD ==A1& TYPE == 1,]
#U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
#1:U1 A1 1 0 1 0 0 ID1
#2:U11 A1 1 0 2 0 0 ID11
#3:U21 A1 1 0 3 0 0 ID21
#...
#19:U181 A1 1 0 19 0 0 ID181
#20:U191 A1 1 0 20 0 0 ID191
#21:NA A1 1 NA 21 0 0 ID201

最后一行在某些列中有 NA 。这是由于 dt1 dt2 中此组的行数不同。 dt1 有21行,而 dt2 只有20行。因此, dt1 的最后一行在 dt2 中没有匹配项。



或者,内连接只会返回 dt1 dt2 中具有匹配的行:

 #inner join 
dt0
#显示一个组的结果
dt0 [AREA_CD ==A1& TYPE == 1,]
#U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
#1:U1 A1 1 0 1 0 0 ID1
#2:U11 A1 1 0 2 0 0 ID11
#3:U21 A1 1 0 3 0 0 ID21
#...
#18:U171 A1 1 0 18 0 0 ID171
#19:U181 A1 1 0 19 0 0 ID181
#20:U191 A1 1 0 20 0 0 ID191



<$>

资料



<$> p $ p> dt1 AREA_CD = factor(c(rep(A1,205),rep(A2,145),rep ,250),rep(A4,100),rep(A5,300))),
TYPE = rep(1:10),
ALLOCATED = 0,
U_ID = 0,
ID_CD = factor(paste0(ID,1:1000)))
dt2 U_ID = factor 1:1000)),
AREA_CD = factor(c(rep(A1,200),rep(A2,155) ),rep(A5,310))),
TYPE = rep(1:10),
ASSIGNED = 0)

请注意,使用<$ c创建 ID_CD U_ID $ c> paste0()而不是 interaction(),结果是相当慢。


I have been working on matching values between two large data tables (around 25 million records each) based on unique and non-unique values. I posted a question yesterday on how to update values in a data.table using another data.table and the answers have made me wonder how I can improve the efficiency of this matching. In my example dt1 contains a non-unique area code and a type which can be from 1 to 10 (along with some blank columns). dt2 contains the same non-unique area codes (although not the same number as in dt1), types from 1 to 10 and a unique ID (U_ID).

What I would like to do is find the AREA_CD and TYPE rows in dt1 and dt2 that match and copy the U_ID for those rows from dt2 to dt1. The issue is that dt1 and dt2 do not have the same number of instances of each unique combination. For example AREA_CD 'A1' and TYPE '1' occur 21 times in dt1 and only 20 times in dt2. In these instances then the minimum number of rows (so 20 in this case) would have the match operation applied leaving 1 row in dt1 unmodified (if dt2 had more rows than dt1 then the number of rows in dt1 would be used).

Here is an example of my dataset. My actual datasets have around 25,000,000 rows and contain around 10,000 unique areas and types from 1 to 10.

require("data.table")

df1 <-data.frame(AREA_CD = c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300)), TYPE = rep(1:10), ALLOCATED = 0, U_ID = 0, ID_CD = c(1:1000))
df1$ID_CD <- interaction( "ID", df1$ID_CD, sep = "")

df2 <-data.frame(U_ID = c(1:1000), AREA_CD = c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310)), TYPE = rep(1:10), ASSIGNED = 0)
df2$U_ID <- interaction( "U", df2$U_ID, sep = "")

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

The output I am looking for would look something like this:

for(o in 1:5){
Ao <- paste("A",o,sep="")
for(i in 1:10){
R.Num <- min(nrow(with(df1, df1[AREA_CD == Ao & TYPE == i ,])), nrow(with(df2, df2[AREA_CD == Ao & TYPE == i ,])))
df1[df1$AREA_CD == Ao & df1$TYPE == i,][1:R.Num,"U_ID"]  <- as.character(df2[df2$AREA_CD == Ao & df2$TYPE == i,][1:R.Num,"U_ID"])
}}

I hope that makes sense.

解决方案

For each group of AREA_CD and TYPE, the OP wants to match the rows one to one in the order they appear in both data.tables. E.g., the first row with AREA_CD == "A1" & TYPE == 1 in dt1 shall match the first row with AREA_CD == "A1" & TYPE == 1 in dt2, then the second rows and so forth.

This can be done in a join operation if a row index I (or running count) within each group is added:

# add row indices
dt1[, I := seq_len(.N), by = .(AREA_CD, TYPE)]
dt2[, I := seq_len(.N), by = .(AREA_CD, TYPE)]

# alternative code: same result as above but more concise
dt1[, I := rowid(AREA_CD, TYPE)]
dt2[, I := rowid(AREA_CD, TYPE)]


# right join (all rows of dt1 are used)
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I)]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
#    U_ID AREA_CD TYPE ASSIGNED  I ALLOCATED i.U_ID ID_CD
# 1:   U1      A1    1        0  1         0      0   ID1
# 2:  U11      A1    1        0  2         0      0  ID11
# 3:  U21      A1    1        0  3         0      0  ID21
#...
#19: U181      A1    1        0 19         0      0 ID181
#20: U191      A1    1        0 20         0      0 ID191
#21:   NA      A1    1       NA 21         0      0 ID201

Note that the last row has a NA in some columns. This is due to the different number of rows for this group in dt1 and dt2. dt1 has 21 rows while dt2 has only 20 rows. So, the last row of dt1 has no match in dt2.

Alternatively, a inner join will only return rows with a match in both dt1 and dt2:

# inner join
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I), nomatch = 0]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
#    U_ID AREA_CD TYPE ASSIGNED  I ALLOCATED i.U_ID ID_CD
# 1:   U1      A1    1        0  1         0      0   ID1
# 2:  U11      A1    1        0  2         0      0  ID11
# 3:  U21      A1    1        0  3         0      0  ID21
#...
#18: U171      A1    1        0 18         0      0 ID171
#19: U181      A1    1        0 19         0      0 ID181
#20: U191      A1    1        0 20         0      0 ID191

Now, only 20 rows are returned for this group.


Data

dt1 <- data.table(
  AREA_CD = factor(c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300))), 
  TYPE = rep(1:10), 
  ALLOCATED = 0, 
  U_ID = 0, 
  ID_CD = factor(paste0("ID", 1:1000)))
dt2 <- data.table(
  U_ID = factor(paste0("U", 1:1000)), 
  AREA_CD = factor(c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310))), 
  TYPE = rep(1:10), 
  ASSIGNED = 0)

Note that ID_CD and U_ID are created using paste0() instead of interaction() which turned out to be rather slow.

这篇关于在data.tables和update data.table之间匹配唯一值和非唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆