当匹配第二个表的数据时，在`data.table`中创建一个向量列的最有效的方法是什么？ [英] What is the most efficient way to create a column of vectors in `data.table` when matching data from a second table?

查看：66 发布时间：2017/3/12 11:45:40 r data.table

本文介绍了当匹配第二个表的数据时，在`data.table`中创建一个向量列的最有效的方法是什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 data.table
中创建向量列的最有效的方法是，我们需要从第二个 data.table 。

例如，假设下面的两个data.tables

  ; A_ids.DT> rec_data_table 
 name id bid counts names_list 
 1：A 1 1：301 21 C，E 
 2：B 2 2：302 21 E 
 3：C 3 3：303 5 H，E，G 
 4：D 4 4：304 10 H，D 
 5：F 6 5：305 3 E 
 6：G 7 6：306 5 G 
 7：H 8 7：307 6 B，C 
 8：J 10 
 9：K 11

b $ b

我想在 rec_data_table 中创建一个新列，其中每个元素都是来自 A_ids.DT 中引用的code> rec_data_table [，names_list]

 
 
 重要事项：  names_list 必须反映在新列中。即：对于 3：（ H，E，G ），我们应该得到 c 8，NA，7） 
 
 
  以下行使用 sapply 工作，但我质疑它的效率。 
 
有更好的（即更快，更优雅）的选择吗？ （请注意，实际数据是几个100K的行）
  rec_data_table [，A_IDs.list：= sapply（names_list，function n）c（A_ids.DT [n，id] $ id））] 
 
出价计数names_list A_IDs.list 
 1：301 21 C，E 3，NA 
 2 ：302 21 E NA 
 3：303 5 H，E，G 8，NA，7 
 4：304 10 H，D 8,4 
 5：305 3 E NA 
 6：306 5 G 7 
 7：307 6 B，C 2,3 
  
 
 
 < 
 
 ＃---------------------- ----------------------------＃
＃样本数据＃
 
库（data.table ）
 set.seed（101）
 
 rows<  -  size<  -  7 
 varyingLengths<  -  c（sample（1：3，rows，TRUE））
 A<  -  lapply（varyingLengths，function（n）sample（LETTERS [1：8]，n））
 counts < -  round（abs（rnorm（size）* 12））
 rec_data_table<  -  data.table（bid = 300 +（1：size），counts = counts，names_list = A，key =bid）
 
 A_ids.DT < （name = LETTERS [c（1：4,6：8,10：11）]，id = c（1：4,6：8,10：11），key =name）
  
 
 
解决方案
也许解包列表，然后加入整个表，然后重新包装？
  tmp <-setkey（rec_data_table [，list（names = names_list [[1]]，
 orig.order = seq_along （name_list [[1]]）），
 by = list（bid，counts）]，
 tmp<  -  A_ids.DT [tmp] 
 setkey（tmp，orig。 order）
 tmp < -  tmp [，list（names_list = list（name），A_IDs.list = list（id）），
 by = list（bid，counts）] 
 
＃重新排列以取样输出顺序
 setkey（tmp，bid）
 setcolorder（tmp，c（bid，counts，names_list，A_IDs.list））
 
 
 ###输出### 
> tmp 
＃bid counts names_list A_IDs.list 
＃1：301 21 C，E 3，NA 
＃2：302 21 E NA 
＃3：303 5 H，E ，G 8，NA，7 
＃4：304 10 H，D 8,4 
＃5：305 3 E NA 
＃6：306 5 G 7 
＃7 ：307 6 B，C 2,3 
 
>相同（tmp，rec_data_table [，A_IDs.list：= sapply（names_list，function（n）c（A_ids.DT [n，id] $ id））]）
＃[1] TRUE 
  
 
 
 
时间
 
 
 我增加了 rec_data_table 到 1e5 ，并得到以下时间。 
 
 
 有问题的方法：
  system.time（rec_data_table [，A_IDs.list：= sapply（names_list，function（n）c（A_ids.DT [n，id] $ id））]）
用户系统已过
 196.89 0.04 197.81 
  
方法如下：
 > system.time（{
 + tmp<  -  setkey（rec_data_ta .... [TRUNCATED] 
用户系统已过去
 0.95 0.00 0.95 
  / pre> 
What is the most efficient way to create a column of vectors in a data.table 
where we need to match elements from a second data.table.  

For example, given the two data.tables below
   > A_ids.DT        > rec_data_table
      name id           bid counts names_list
   1:    A  1        1: 301     21        C,E
   2:    B  2        2: 302     21          E
   3:    C  3        3: 303      5      H,E,G
   4:    D  4        4: 304     10        H,D
   5:    F  6        5: 305      3          E
   6:    G  7        6: 306      5          G
   7:    H  8        7: 307      6        B,C
   8:    J 10        
   9:    K 11        
I would like to create a new column in rec_data_table where each element is a list of the id's from A_ids.DT as referenced in rec_data_table[,names_list]

IMPORTANT:  The order represented in each entry of names_list must be reflected in the new column.  ie:  for row 3: (H, E, G) we should get c(8, NA, 7) 

The following line, which uses sapply works, but I question its efficiency.

Are there better (ie quicker, more elegant) alternatives?  (Note that the actual data is several 100K of rows) 
rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]

   bid counts names_list A_IDs.list
1: 301     21        C,E       3,NA
2: 302     21          E         NA
3: 303      5      H,E,G     8,NA,7
4: 304     10        H,D        8,4
5: 305      3          E         NA
6: 306      5          G          7
7: 307      6        B,C        2,3
#--------------------------------------------------#
#           SAMPLE DATA                            #

library(data.table)
set.seed(101)

  rows <- size <- 7
  varyingLengths <- c(sample(1:3, rows, TRUE))
  A <-  lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
  counts <- round(abs(rnorm(size)*12))   
rec_data_table <- data.table(bid=300+(1:size), counts=counts, names_list=A, key="bid")

A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")

 解决方案 
Perhaps unpack the lists, then join the whole table, then repack?
tmp <- setkey(rec_data_table[, list(names = names_list[[1]],
                                    orig.order = seq_along(names_list[[1]])),
                             by = list(bid, counts)], names)
tmp <- A_ids.DT[tmp]
setkey(tmp, orig.order)
tmp <- tmp[, list(names_list = list(name), A_IDs.list = list(id)),
           by = list(bid, counts)]

# Rearrange to sample output order
setkey(tmp, bid)
setcolorder(tmp, c("bid", "counts", "names_list", "A_IDs.list"))


### Output###
> tmp
#   bid counts names_list A_IDs.list
# 1: 301     21        C,E       3,NA
# 2: 302     21          E         NA
# 3: 303      5      H,E,G     8,NA,7
# 4: 304     10        H,D        8,4
# 5: 305      3          E         NA
# 6: 306      5          G          7
# 7: 307      6        B,C        2,3

> identical(tmp, rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
# [1] TRUE


Timings

I increased the number of rows in rec_data_table to 1e5 and got the following timings. 

Method presented in question: 
> system.time(rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
   user  system elapsed 
 196.89    0.04  197.81 
Method presented here:
> system.time( {
+ tmp <- setkey(rec_data_ta .... [TRUNCATED] 
   user  system elapsed 
   0.95    0.00    0.95 


                        
这篇关于当匹配第二个表的数据时，在`data.table`中创建一个向量列的最有效的方法是什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

当匹配第二个表的数据时，在`data.table`中创建一个向量列的最有效的方法是什么？ [英] What is the most efficient way to create a column of vectors in `data.table` when matching data from a second table?

问题描述

`时间`

Timings

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

当匹配第二个表的数据时，在`data.table`中创建一个向量列的最有效的方法是什么？ [英] What is the most efficient way to create a column of vectors in `data.table` when matching data from a second table?

问题描述

时间

Timings

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

`时间`

登录关闭