如何从一个data.table中选择行以应用于另一个data.table? [英] How to select rows from one data.table to apply in another data.table?

查看:104
本文介绍了如何从一个data.table中选择行以应用于另一个data.table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个data.tables df (21个MIO行)和 tmp (500k行)



df 有三列将原始专利( origpat )参考专利( refpat ),并将原始分类( mainprim )绑定到 origpat



显示下面的30条第一行。每个 origpat refpat code> origpat 出现在1和300次之间,每个 refpat 介于1和3,100次之间

  dput(df [1:30,-3])
结构(list(origpat = c(4247592,4247592,4247592,4247592,
4247592,4247592,4247592,4247592,4247592,4247592,4247592,4247592,4247592,
4247592,4247592,4247592,4247592,4247592,4247592,4247592,
4247592,4247592,4247592,4247592,4247592,4247592,4247592,
4247592,4247592,4247592,4247592,4247592),ref.pat = c(4318978,
4436368,4358181,4478622,4312654,4293439,4286061,4363648,
4406517,4478623,4277285 ,4375743,4470520,4328022,4248614,
4297139,4296607,4296608,4395271,4321141,42494190,4431420,
4322467,4285730,4393138,4246034,4251278,439394,4277322,
4290586 ),mainprim = c(442,442,442,442,442,442,
442,442,442,442 442,442,442,442,442,
442,442,442,442,442,442 ,442,442,
442,442,442,442,442,442)),.Names = c(origpat,
ref.pat,mainprim),row.names = c(NA,30L),class = c(data.table,
data.frame))

tmp 包含专利列表 pnum 及其各自的主要分类 prim df 中的所有 origpat refpat $ c> pnum (专利号)在 tmp 中。
作为示例数据,我选择 tmp data.table,其中包含与所选 df dput

$ c> c>变量
$ b

 结构(list(pnum = c(4318978,4318978,4318978,4318978,
4318978,4318978,4318978,4318978,4436368,4436368 ,
4436​​368,4436​​368,4358181,4358181,4358181,4358181,
4478622,4312654,4312654,4312654 4312654,4312654,
4312654,4293439,4293439,4293439,4293439,4293439,
4293439,4293439,4293439 ,4293439,4293439,4293439,
4293439,4293439,4286061,4286061,4286061,4286061,
4286061 「4286061」,「4286061」,「4286061」,「4363648」,「4363648」,
「4363648」,「4406517」,「4478623」,「4478623」,「4277285」,「4375743」 $ b4375743,4375743,4375743,4470520,4470520,4470520,
4328022,4328022,4248614,4248614,4248614 4248614,
4248614,4248614,4297139,4297139,4297139,4297139,
4297139,4296607,4296607,4296607 ,4296608,4296608,
4395271,4395271,4296607,4296607,
4296607,4296608 4395271,4321141,4321141,4321141,
4321141,4294190,4294190,4294190,4294190,4294190,
4294190 ,4431420,4431420,4431420,4431420,4431420,
4431420,4322467,4322467,4322467,4322467,4322467
4322467,4322467,4322467,4322467,4322467,4285730,
4285730,4393138,4393138,4393138,4393138 ,4393138,
4393138,4393138,4246034,4246034,4246034,4246034,
4251278,4251278,4251278 4339174,4339174,4339174,
4339174,4277322,4277322,4290586,4290586,4290586,
4290586,4290586 4247592,4247592,4247592,4247592,4247592,4247592,
4247592,4247592 ,
430,430,430,430,430,430,430,340,385 385,385,385,65,65,65,118,427,65,
65 ,516,516,516,511,106,106,106,501,501,501 516,516,435,435,435,
435,435,435,435 ,65,385,
65,65,501,422,53,53,53,222, 604,
65,65,385,385,65,65,65,65,106,106 b501,501,501,252,423,423,501,505,62,
423,501 501,505,62,65,65,65,210,210,
,433,433,501,156,118,118,118,118, 427,427,428,
428,428,428,428,428,428,501 ,
426,426,435,435,435,435,428,501,501 501,501,65,385,385,385,65,204,
204,204,266 ,432,73,427,427,428,
442,442,442,442,8,8 ,.Names = c(pnum,prim
),class = c(data.table,data.frame),row.names = c(NA,-147L
),.internal.selfref =< pointer:0x0000000000100788>)

mainprim (链接到 origpat )与 prim 变量链接到 refpat



下面的代码可以工作,但是速度太慢了。

  library(data.table)
df< - data.table(df); setfull(df,refpat,origpat)
refs - 唯一的(df $ refpat)#捕获df中的所有唯一的refpat(整个data.table中的71,000)
startrow< b $ b overlap< - function(a,b)sum(a == b)/ length(b)
df $ compare < - NA#重叠值将被插入这里

for(h in 1:length(refs)){
refclass <-tmp $ prim [tmp $ pnum == refs [h]] #subgroup of related'prim'
x < (df $ refpat [df $ refpat == refs [h]])
prims < - df $ mainprim [startrow:(startrow + x)]从大的`df` data.table隔离子集以减少内存需要在第二循环中
for(i in 1:x){
df $ compare [startrow + i] < - overlap(prims [i],refclass)
}
startrow< - startrow + x
print(h)
}

原因我使用两个 for 循环是为了节省计算机内存。我可以使用一个单一的,并重新确定 refclass 为每一行,但这使我的电脑在几分钟内崩溃。这个循环工作,但在大约250小时内完成的速度。
我确定有办法简单地将 tmp 中所需的行分组到 df 中,然后重复这对于每个 origpat 但我的data.table技能不能完成任务,我没有找到 解释 如何使这项工作在SO或data.table pdf文件。



任何建议都非常欢迎



EDIt @Frank我想做的具体比较总是在变化。主要问题如下。考虑一个具有两列链接 pnum (专利号)的列的 df ,一个名为 origpat ,第二个名为 ref.pat 。每个列包含多个重复的 pnum ,但每个组合(在单个行上)是唯一的。它建立了一个固定专利和一个旧专利之间的联系。此数据集大约有22个MIO行。
然后我有多个其他数据表,例如。一个将 pnum 链接到发明人,一个链接 pnum 到各种技术分类。我感兴趣的是找到以成对方式比较链接数据(例如,发明人,技术类)的最快方法,其中在 df (即 origpat ref.pat )。到目前为止,我所拥有的data.table解决方案是最快的,但它仍然需要多天才能完成一个新的比较。
希望这有助于

解决方案

我最好的想法是:


$ b $添加一个索引到数据表中按df
df [,compare:= sum(tmp [pnum == ref.pat,prim] == mainprim)/
length(tmp [pnum == ref.pat,prim]),by = idx]

或重复使用重叠函数(仍使用idx列):

  df [,compare:= overlap(
mainprim,
tmp [pnum == ref.pat,prim]),
by = idx]

它在这里是按行分组,然后使用Subset Data为此行获取 mainprim ,并需要 tmp 的子集。



如果你想避免创建 idx 列,你可以使用 by = 1:nrow(df)






@Docendo的很大改进:



您可以通过创建一个中间变量来存储子集,而不是每行执行子集两次:

  df [,compare:= {x = tmp [pnum == ref.pat,prim ]; sum(x == mainprim)/ length(x)},by = idx] 

df ref.pat mainprim 的重复组合>您可以使用 by = list(ref.pat,mainprim)而不是 by = idx

  df [,compare:= {x = tmp [pnum == ref.pat,prim]; sum(x == mainprim)/ length(x)},
by = list(ref.pat,mainprim)]

另一个可能只是最小的改进可以通过使用 mean()而不是 sum ()

  df [,compare:= mean(tmp [pnum == ref.pat ,prim] == mainprim),by = list(ref.pat,mainprim)] 


I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat) with a reference patent (refpat), and tying an original classification (mainprim) to the origpat.

Below the 30 first lines are shown. Every origpat, refpat pair is unique but every origpat appears between 1 and 300 times and every refpat between 1 and 3,100 times

dput(df[1:30,-3])
structure(list(origpat = c(4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978, 
4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648, 
4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614, 
4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420, 
4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322, 
4290586), mainprim = c("442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442")), .Names = c("origpat", 
"ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table", 
"data.frame"))

tmp contains a list of patents pnum with their respective primary classifications prim. All the origpat and refpat in df are pnum (patent numbers) in tmp. As example data I give the selection of tmp data.table that contain all the info relating to the selected df variables through dput

dput(tmp)
structure(list(pnum = c("4318978", "4318978", "4318978", "4318978", 
"4318978", "4318978", "4318978", "4318978", "4436368", "4436368", 
"4436368", "4436368", "4358181", "4358181", "4358181", "4358181", 
"4478622", "4312654", "4312654", "4312654", "4312654", "4312654", 
"4312654", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4286061", "4286061", "4286061", "4286061", 
"4286061", "4286061", "4286061", "4286061", "4363648", "4363648", 
"4363648", "4406517", "4478623", "4478623", "4277285", "4375743", 
"4375743", "4375743", "4375743", "4470520", "4470520", "4470520", 
"4328022", "4328022", "4248614", "4248614", "4248614", "4248614", 
"4248614", "4248614", "4297139", "4297139", "4297139", "4297139", 
"4297139", "4296607", "4296607", "4296607", "4296607", "4296607", 
"4296607", "4296608", "4296608", "4296608", "4296608", "4296608", 
"4395271", "4395271", "4395271", "4321141", "4321141", "4321141", 
"4321141", "4294190", "4294190", "4294190", "4294190", "4294190", 
"4294190", "4431420", "4431420", "4431420", "4431420", "4431420", 
"4431420", "4322467", "4322467", "4322467", "4322467", "4322467", 
"4322467", "4322467", "4322467", "4322467", "4322467", "4285730", 
"4285730", "4393138", "4393138", "4393138", "4393138", "4393138", 
"4393138", "4393138", "4246034", "4246034", "4246034", "4246034", 
"4251278", "4251278", "4251278", "4339174", "4339174", "4339174", 
"4339174", "4277322", "4277322", "4290586", "4290586", "4290586", 
"4290586", "4290586", "4247592", "4247592", "4247592", "4247592", 
"4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430", 
"430", "430", "430", "430", "430", "430", "430", "340", "385", 
"385", "385", "385", "385", "65", "65", "65", "118", "427", "65", 
"65", "65", "65", "106", "106", "106", "501", "501", "501", "501", 
"501", "516", "516", "516", "516", "516", "435", "435", "435", 
"435", "435", "435", "435", "435", "156", "428", "65", "385", 
"65", "65", "501", "422", "53", "53", "53", "222", "422", "604", 
"65", "65", "385", "385", "65", "65", "65", "65", "106", "106", 
"501", "501", "501", "252", "423", "423", "501", "505", "62", 
"423", "501", "501", "505", "62", "65", "65", "65", "210", "210", 
"210", "435", "118", "118", "118", "118", "118", "118", "106", 
"433", "433", "433", "433", "501", "156", "427", "427", "428", 
"428", "428", "428", "428", "428", "428", "501", "501", "426", 
"426", "426", "435", "435", "435", "435", "428", "501", "501", 
"501", "501", "501", "65", "385", "385", "385", "65", "204", 
"204", "204", "266", "266", "432", "73", "427", "427", "428", 
"442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim"
), class = c("data.table", "data.frame"), row.names = c(NA, -147L
), .internal.selfref = <pointer: 0x0000000000100788>)

Now, I want to compare the mainprim (which is linked to the origpat) with the different prim variables linked to refpat.

Code below works but is prohibitively slow.

library(data.table)
df <- data.table(df) ; setkey(df, refpat, origpat) 
refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table)
startrow <- 0 # Set loop    
overlap <- function(a,b) sum (a == b) / length(b)
df$compare <- NA # overlap values will be inserted here

for (h in 1:length(refs)) {
  refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim'
    x <- length(df$refpat[df$refpat == refs[h]])
    prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop
      for (i in 1:x) {
      df$compare[startrow + i] <- overlap(prims[i], refclass) 
        }
    startrow <- startrow + x
    print(h)
  }

The reason I use two for loops is to save computer memory. I could use a single one and redetermine refclass for every row but that made my computer crash within minutes. This loop works but at a speed of being done in about 250 hours. I'm sure there are ways to simply subset the needed rows from tmp within df and then repeat this for every origpat but my data.table skills are not up to the task and I don't find an answer that explains how to make this work on SO or on the data.table pdf files.

Any suggestions are very welcome

EDIt @Frank The specific comparison I want to make is always changing. The main issue is the following. Consider a long df with two columns of linked pnum (patent numbers), one called origpat and the second called ref.pat. Every column contains multiple repeated pnum but every combination (on a single row) is unique. It establishes a link between a firm patent and an older patent. This dataset is about 22 MIO rows. Then I have multiple other datatables, e.g. one linking the pnum to the inventors, one linking the pnum to various technology classifications. What I am interested in is finding the fastest way to compare the linked data (e.g. inventors, technology classes) on a pairwise basis, with the pair defined in the rows of the df (i.e. origpat and ref.pat). So far the data.table solution I have is the fastest but it still takes multiple days to complete a single new comparison. Hope this helps

解决方案

Best idea I came with is:

df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
     length(tmp[pnum == ref.pat,prim]),by = idx]

Or reusing your overlap function (still using the idx column):

df[,compare := overlap(
                mainprim,
                tmp[pnum == ref.pat, prim]),
    by=idx]

What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).


Great improvements by @Docendo:

You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]

And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
   by = list(ref.pat, mainprim)]

And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():

df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]

这篇关于如何从一个data.table中选择行以应用于另一个data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆