汇总自连接索引,同时避免R data.table中的笛卡尔乘积 [英] Summarize the self-join index while avoiding cartesian product in R data.table

查看:153
本文介绍了汇总自连接索引,同时避免R data.table中的笛卡尔乘积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用2列 data.table ,我想通过对第2列中共享元素的数量求和来总结第1列中的成对关系。在其他单词,X值的每个成对组合有多少共享的Y元素?



例如,我可以在两步过程中做到这一点,首先做一个笛卡尔交叉连接,然后对其进行汇总:

  d = data.table(X = c ,2,2,2,3,3,3,4,4),Y = c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d,Y)
d2 = d [d,allow.cartesian = TRUE]
d2 [,.N,by = c(X,iX)]
#X iX N
#1:1 1 3
#2:2 1 3
#3:3 1 1
#4:1 2 3
#5 :2 2 4
#6:3 2 1
#7:1 3 1
#8:2 3 1
#9:3 3 3
#10 :4 2 1
#11:2 4 1
#12:4 4 2
#13:4 3 1
#14:3 4 1

此结果的第二行表示 X = 1 Y值的 X = 2 ;而 X = 3 只与 X = 4 共享1个y值。



有没有办法绕过笛卡尔联接步骤,这会导致大的低效表?我想在一个有数百万行的表上做这样的事情,笛卡尔连接运行到 2 ^ 31 向量大小限制(除了变慢)。 / p>

我想象这样的:

  d [d,列表(X,长度(Y))by = c(X,iX)] 

但是这给出错误 iX not found



我可以在SQL中使用下面的代码 - 但是只是无法弄清楚如何将它转换为data.table语法:

  CREATE TABLE test(X integer,Y integer ); 
INSERT INTO测试值(1,1);
INSERT INTO测试值(1,2);
INSERT INTO测试值(1,3);
INSERT INTO测试值(2,1);
INSERT INTO测试值(2,2);
INSERT INTO测试值(2,3);
INSERT INTO测试值(2,4);
INSERT INTO测试值(3,1);
INSERT INTO测试值(3,5);
INSERT INTO测试值(3,6);
INSERT INTO测试值(4,4);
INSERT INTO测试值(4,5);

SELECT A.X,B.X,COUNT(A.Y)as N FROM test as A JOIN test as B WHERE A.Y == B.Y GROUP BY A.X,B.X;

关键是我想要总结的列与我加入的列相同。此问题与这些问题类似,但不完全相同:



R Data.Table加入条件



如何在条件上自加入data.table



关键区别在于我想汇总索引列,这似乎不可能通过= .EACHI。

解决方案

如果您可以将 Y 分成没有 X ,你可以先由这些组进行计算,得到一个较小的中间表:

  d [,grp := Y <= 3]#这个特定的分割对OP数据
d [,.SD [.SD,allow = T] [,.N,by =。(X,iX) grp] [,
。(N = sum(N)),by =。(X,iX)]


b $ b

上面的中间表只有16行,而不是26.不幸的是我不能想到一个自动创建这样的分组的简单方法。


With a 2-column data.table, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?

For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:

d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
 #  X i.X N
 #1: 1   1 3
 #2: 2   1 3
 #3: 3   1 1
 #4: 1   2 3
 #5: 2   2 4
 #6: 3   2 1
 #7: 1   3 1
 #8: 2   3 1
 #9: 3   3 3
#10: 4   2 1
#11: 2   4 1
#12: 4   4 2
#13: 4   3 1
#14: 3   4 1

The second row of this result indicates, that X=1 shares 3 Y-values with X=2; while X=3 shares only 1 y-value with X=4.

Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31 vector size limit (in addition to becoming slow).

I'm imagining something like this:

d[d, list(X, length(Y)), by=c("X", "i.X")]

But this gives the error i.X not found

I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:

CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);

SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;

The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:

R Data.Table Join on Conditionals

How to self join a data.table on a condition

The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.

解决方案

If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:

d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
    .(N = sum(N)), by = .(X, i.X)]

The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.

这篇关于汇总自连接索引,同时避免R data.table中的笛卡尔乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆