当匹配第二个表的数据时,在`data.table`中创建一个向量列的最有效的方法是什么? [英] What is the most efficient way to create a column of vectors in `data.table` when matching data from a second table?

查看:66
本文介绍了当匹配第二个表的数据时,在`data.table`中创建一个向量列的最有效的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table
中创建向量列的最有效的方法是,我们需要从第二个 data.table



例如,假设下面的两个data.tables

  ; A_ids.DT> rec_data_table 
name id bid counts names_list
1:A 1 1:301 21 C,E
2:B 2 2:302 21 E
3:C 3 3:303 5 H,E,G
4:D 4 4:304 10 H,D
5:F 6 5:305 3 E
6:G 7 6:306 5 G
7:H 8 7:307 6 B,C
8:J 10
9:K 11


b $ b

我想在 rec_data_table 中创建一个新列,其中每个元素都是来自 A_ids.DT 中引用的code> rec_data_table [,names_list]



重要事项: names_list 必须反映在新列中。即:对于 3: H,E,G ),我们应该得到 c 8,NA,7)



以下行使用 sapply 工作,但我质疑它的效率。

有更好的(即更快,更优雅)的选择吗? (请注意,实际数据是几个100K的行)

  rec_data_table [,A_IDs.list:= sapply(names_list,function n)c(A_ids.DT [n,id] $ id))] 

出价计数names_list A_IDs.list
1:301 21 C,E 3,NA
2 :302 21 E NA
3:303 5 H,E,G 8,NA,7
4:304 10 H,D 8,4
5:305 3 E NA
6:306 5 G 7
7:307 6 B,C 2,3



<


 #---------------------- ----------------------------#
#样本数据#

库(data.table )
set.seed(101)

rows< - size< - 7
varyingLengths< - c(sample(1:3,rows,TRUE))
A< - lapply(varyingLengths,function(n)sample(LETTERS [1:8],n))
counts < - round(abs(rnorm(size)* 12))
rec_data_table< - data.table(bid = 300 +(1:size),counts = counts,names_list = A,key =bid)

A_ids.DT < (name = LETTERS [c(1:4,6:8,10:11)],id = c(1:4,6:8,10:11),key =name)


解决方案

也许解包列表,然后加入整个表,然后重新包装?

  tmp <-setkey(rec_data_table [,list(names = names_list [[1]],
orig.order = seq_along (name_list [[1]])),
by = list(bid,counts)],
tmp< - A_ids.DT [tmp]
setkey(tmp,orig。 order)
tmp < - tmp [,list(names_list = list(name),A_IDs.list = list(id)),
by = list(bid,counts)]

#重新排列以取样输出顺序
setkey(tmp,bid)
setcolorder(tmp,c(bid,counts,names_list,A_IDs.list))


###输出###
> tmp
#bid counts names_list A_IDs.list
#1:301 21 C,E 3,NA
#2:302 21 E NA
#3:303 5 H,E ,G 8,NA,7
#4:304 10 H,D 8,4
#5:305 3 E NA
#6:306 5 G 7
#7 :307 6 B,C 2,3

>相同(tmp,rec_data_table [,A_IDs.list:= sapply(names_list,function(n)c(A_ids.DT [n,id] $ id))])
#[1] TRUE



时间



我增加了 rec_data_table 1e5 ,并得到以下时间。



有问题的方法:

  system.time(rec_data_table [,A_IDs.list:= sapply(names_list,function(n)c(A_ids.DT [n,id] $ id))])
用户系统已过
196.89 0.04 197.81

方法如下:

 > system.time({
+ tmp< - setkey(rec_data_ta .... [TRUNCATED]
用户系统已过去
0.95 0.00 0.95
/ pre>

What is the most efficient way to create a column of vectors in a data.table where we need to match elements from a second data.table.

For example, given the two data.tables below

   > A_ids.DT        > rec_data_table
      name id           bid counts names_list
   1:    A  1        1: 301     21        C,E
   2:    B  2        2: 302     21          E
   3:    C  3        3: 303      5      H,E,G
   4:    D  4        4: 304     10        H,D
   5:    F  6        5: 305      3          E
   6:    G  7        6: 306      5          G
   7:    H  8        7: 307      6        B,C
   8:    J 10        
   9:    K 11        

I would like to create a new column in rec_data_table where each element is a list of the id's from A_ids.DT as referenced in rec_data_table[,names_list]

IMPORTANT: The order represented in each entry of names_list must be reflected in the new column. ie: for row 3: (H, E, G) we should get c(8, NA, 7)

The following line, which uses sapply works, but I question its efficiency.
Are there better (ie quicker, more elegant) alternatives? (Note that the actual data is several 100K of rows)

rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]

   bid counts names_list A_IDs.list
1: 301     21        C,E       3,NA
2: 302     21          E         NA
3: 303      5      H,E,G     8,NA,7
4: 304     10        H,D        8,4
5: 305      3          E         NA
6: 306      5          G          7
7: 307      6        B,C        2,3


#--------------------------------------------------#
#           SAMPLE DATA                            #

library(data.table)
set.seed(101)

  rows <- size <- 7
  varyingLengths <- c(sample(1:3, rows, TRUE))
  A <-  lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
  counts <- round(abs(rnorm(size)*12))   
rec_data_table <- data.table(bid=300+(1:size), counts=counts, names_list=A, key="bid")

A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")

解决方案

Perhaps unpack the lists, then join the whole table, then repack?

tmp <- setkey(rec_data_table[, list(names = names_list[[1]],
                                    orig.order = seq_along(names_list[[1]])),
                             by = list(bid, counts)], names)
tmp <- A_ids.DT[tmp]
setkey(tmp, orig.order)
tmp <- tmp[, list(names_list = list(name), A_IDs.list = list(id)),
           by = list(bid, counts)]

# Rearrange to sample output order
setkey(tmp, bid)
setcolorder(tmp, c("bid", "counts", "names_list", "A_IDs.list"))


### Output###
> tmp
#   bid counts names_list A_IDs.list
# 1: 301     21        C,E       3,NA
# 2: 302     21          E         NA
# 3: 303      5      H,E,G     8,NA,7
# 4: 304     10        H,D        8,4
# 5: 305      3          E         NA
# 6: 306      5          G          7
# 7: 307      6        B,C        2,3

> identical(tmp, rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
# [1] TRUE

Timings

I increased the number of rows in rec_data_table to 1e5 and got the following timings.

Method presented in question:

> system.time(rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
   user  system elapsed 
 196.89    0.04  197.81 

Method presented here:

> system.time( {
+ tmp <- setkey(rec_data_ta .... [TRUNCATED] 
   user  system elapsed 
   0.95    0.00    0.95 

这篇关于当匹配第二个表的数据时,在`data.table`中创建一个向量列的最有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆