汇总自连接索引，同时避免R data.table中的笛卡尔乘积 [英] Summarize the self-join index while avoiding cartesian product in R data.table

查看：153 发布时间：2017/3/12 11:16:03 r data.table self-join cross-join

本文介绍了汇总自连接索引，同时避免R data.table中的笛卡尔乘积的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用2列 data.table ，我想通过对第2列中共享元素的数量求和来总结第1列中的成对关系。在其他单词，X值的每个成对组合有多少共享的Y元素？

例如，我可以在两步过程中做到这一点，首先做一个笛卡尔交叉连接，然后对其进行汇总：

  d = data.table（X = c ，2,2,2,3,3,3,4,4），Y = c（1,2,3,1,2,3,4,1,5,6,4,5））
 setkey（d，Y）
 d2 = d [d，allow.cartesian = TRUE] 
 d2 [，.N，by = c（X，iX）] 
 ＃X iX N 
＃1：1 1 3 
＃2：2 1 3 
＃3：3 1 1 
＃4：1 2 3 
＃5 ：2 2 4 
＃6：3 2 1 
＃7：1 3 1 
＃8：2 3 1 
＃9：3 3 3 
＃10 ：4 2 1 
＃11：2 4 1 
＃12：4 4 2 
＃13：4 3 1 
＃14：3 4 1

此结果的第二行表示 X = 1 Y值的 X = 2 ;而 X = 3 只与 X = 4 共享1个y值。

 
 
 有没有办法绕过笛卡尔联接步骤，这会导致大的低效表？我想在一个有数百万行的表上做这样的事情，笛卡尔连接运行到 2 ^ 31 向量大小限制（除了变慢）。 / p> 
 
 
我想象这样的：
  d [d，列表（X，长度（Y））by = c（X，iX）] 
  
但是这给出错误 iX not found  
 
 
 我可以在SQL中使用下面的代码 - 但是只是无法弄清楚如何将它转换为data.table语法：
  CREATE TABLE test（X integer，Y integer ）; 
 INSERT INTO测试值（1,1）; 
 INSERT INTO测试值（1，2）; 
 INSERT INTO测试值（1，3）; 
 INSERT INTO测试值（2，1）; 
 INSERT INTO测试值（2，2）; 
 INSERT INTO测试值（2，3）; 
 INSERT INTO测试值（2，4）; 
 INSERT INTO测试值（3，1）; 
 INSERT INTO测试值（3，5）; 
 INSERT INTO测试值（3，6）; 
 INSERT INTO测试值（4，4）; 
 INSERT INTO测试值（4，5）; 
 
 SELECT A.X，B.X，COUNT（A.Y）as N FROM test as A JOIN test as B WHERE A.Y == B.Y GROUP BY A.X，B.X; 
  
关键是我想要总结的列与我加入的列相同。此问题与这些问题类似，但不完全相同：
 
 
   R Data.Table加入条件 
 
 
  如何在条件上自加入data.table  
 
 
 关键区别在于我想汇总索引列，这似乎不可能通过= .EACHI。
解决方案
如果您可以将 Y 分成没有 X ，你可以先由这些组进行计算，得到一个较小的中间表：
  d [，grp ：= Y <= 3]＃这个特定的分割对OP数据
d [，.SD [.SD，allow = T] [，.N，by =。（X，iX） grp] [，
。（N = sum（N）），by =。（X，iX）] 
  
 
 b $ b 
上面的中间表只有16行，而不是26.不幸的是我不能想到一个自动创建这样的分组的简单方法。
 
With a 2-column data.table, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?

For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:
d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
 #  X i.X N
 #1: 1   1 3
 #2: 2   1 3
 #3: 3   1 1
 #4: 1   2 3
 #5: 2   2 4
 #6: 3   2 1
 #7: 1   3 1
 #8: 2   3 1
 #9: 3   3 3
#10: 4   2 1
#11: 2   4 1
#12: 4   4 2
#13: 4   3 1
#14: 3   4 1
The second row of this result indicates, that X=1 shares 3 Y-values with X=2; while X=3 shares only 1 y-value with X=4.

Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31 vector size limit (in addition to becoming slow).

I'm imagining something like this: 
d[d, list(X, length(Y)), by=c("X", "i.X")]
But this gives the error i.X not found

I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:
CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);

SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;
The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:

R Data.Table Join on Conditionals

How to self join a data.table on a condition

The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.
 解决方案 
If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:
d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
    .(N = sum(N)), by = .(X, i.X)]
The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.

                        这篇关于汇总自连接索引，同时避免R data.table中的笛卡尔乘积的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

汇总自连接索引，同时避免R data.table中的笛卡尔乘积 [英] Summarize the self-join index while avoiding cartesian product in R data.table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

汇总自连接索引，同时避免R data.table中的笛卡尔乘积 [英] Summarize the self-join index while avoiding cartesian product in R data.table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭