根据无序的列对聚合数据帧 [英] Aggregate a data frame based on unordered pairs of columns
问题描述
我有一个看起来像这样的数据集:
I have a data set that looks something like this:
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
其中id1
和id2
是图形中的唯一节点,并且size
是分配给连接它们 from id1
的有向边的值到 id2
.该数据集相当大(超过200万行).我想做的是对大小列求和,并按id1
和id2
的无序节点对进行分组.例如,在第一行中,我们有id1=5400
和id2=5505
.数据框中存在另一行,其中id1=5505
和id2=5400
.在分组的数据中,这两行的size列的总和将添加到单行中.因此,换句话说,我想总结我要对(id1,id2)的(无序)集合进行分组的数据.我已经找到了使用apply
和自定义函数来执行此操作的方法,该函数可以检查完整数据集中的反向列对,但是这样做的速度非常慢.有人知道用另一种方法做这件事的方法吗,也许是使用plyr
或基本软件包中的某些方法会更有效?
Where id1
and id2
are unique nodes in a graph, and size
is a value assigned to the directed edge connecting them from id1
to id2
. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1
and id2
. For example, in the first row, we have id1=5400
and id2=5505
. There exists another row in the data frame where id1=5505
and id2=5400
. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using apply
with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr
or with something in the base packages that would be more efficient?
推荐答案
一种方法是使用id1
和id2
的pmax
和pmin
创建额外的列,如下所示.我将在这里使用data.table
解决方案.
One way is to create extra columns with pmax
and pmin
of id1
and id2
as follows. I'll use data.table
solution here.
require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)),
by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
# id1 id2 size
# 1: 5400 5505 18
# 2: 5033 5458 1
# 3: 5452 2873 24
# 4: 5452 5213 2
# 5: 5452 4242 26
# 6: 4823 4823 4
这篇关于根据无序的列对聚合数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!