汇总自连接索引,同时避免R data.table中的笛卡尔乘积 [英] Summarize the self-join index while avoiding cartesian product in R data.table
问题描述
使用2列 data.table
,我想通过对第2列中共享元素的数量求和来总结第1列中的成对关系。在其他单词,X值的每个成对组合有多少共享的Y元素?
例如,我可以在两步过程中做到这一点,首先做一个笛卡尔交叉连接,然后对其进行汇总:
d = data.table(X = c ,2,2,2,3,3,3,4,4),Y = c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d,Y)
d2 = d [d,allow.cartesian = TRUE]
d2 [,.N,by = c(X,iX)]
#X iX N
#1:1 1 3
#2:2 1 3
#3:3 1 1
#4:1 2 3
#5 :2 2 4
#6:3 2 1
#7:1 3 1
#8:2 3 1
#9:3 3 3
#10 :4 2 1
#11:2 4 1
#12:4 4 2
#13:4 3 1
#14:3 4 1
此结果的第二行表示 X = 1
Y值的 X = 2
;而 X = 3
只与 X = 4
共享1个y值。
有没有办法绕过笛卡尔联接步骤,这会导致大的低效表?我想在一个有数百万行的表上做这样的事情,笛卡尔连接运行到 2 ^ 31
向量大小限制(除了变慢)。 / p>
我想象这样的:
d [d,列表(X,长度(Y))by = c(X,iX)]
但是这给出错误 iX not found
我可以在SQL中使用下面的代码 - 但是只是无法弄清楚如何将它转换为data.table语法:
CREATE TABLE test(X integer,Y integer );
INSERT INTO测试值(1,1);
INSERT INTO测试值(1,2);
INSERT INTO测试值(1,3);
INSERT INTO测试值(2,1);
INSERT INTO测试值(2,2);
INSERT INTO测试值(2,3);
INSERT INTO测试值(2,4);
INSERT INTO测试值(3,1);
INSERT INTO测试值(3,5);
INSERT INTO测试值(3,6);
INSERT INTO测试值(4,4);
INSERT INTO测试值(4,5);
SELECT A.X,B.X,COUNT(A.Y)as N FROM test as A JOIN test as B WHERE A.Y == B.Y GROUP BY A.X,B.X;
关键是我想要总结的列与我加入的列相同。此问题与这些问题类似,但不完全相同:
关键区别在于我想汇总索引列,这似乎不可能通过= .EACHI。
如果您可以将 Y
分成没有 X $ c $的大交集的组c>,你可以先由这些组进行计算,得到一个较小的中间表:
d [,grp := Y <= 3]#这个特定的分割对OP数据
d [,.SD [.SD,allow = T] [,.N,by =。(X,iX) grp] [,
。(N = sum(N)),by =。(X,iX)]
b $ b
上面的中间表只有16行,而不是26.不幸的是我不能想到一个自动创建这样的分组的简单方法。
With a 2-column data.table
, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?
For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:
d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
# X i.X N
#1: 1 1 3
#2: 2 1 3
#3: 3 1 1
#4: 1 2 3
#5: 2 2 4
#6: 3 2 1
#7: 1 3 1
#8: 2 3 1
#9: 3 3 3
#10: 4 2 1
#11: 2 4 1
#12: 4 4 2
#13: 4 3 1
#14: 3 4 1
The second row of this result indicates, that X=1
shares 3 Y-values with X=2
; while X=3
shares only 1 y-value with X=4
.
Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31
vector size limit (in addition to becoming slow).
I'm imagining something like this:
d[d, list(X, length(Y)), by=c("X", "i.X")]
But this gives the error i.X not found
I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:
CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);
SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;
The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:
R Data.Table Join on Conditionals
How to self join a data.table on a condition
The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.
If you can split your Y
's into groups that don't have a large intersection of X
's, you could do the computation by those groups first, resulting in a smaller intermediate table:
d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
.(N = sum(N)), by = .(X, i.X)]
The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.
这篇关于汇总自连接索引,同时避免R data.table中的笛卡尔乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!