根据无序的列对聚合数据帧 [英] Aggregate a data frame based on unordered pairs of columns

查看:59
本文介绍了根据无序的列对聚合数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据集:

I have a data set that looks something like this:

     id1  id2   size
1   5400 5505      7
2   5033 5458      1
3   5452 2873     24
4   5452 5213      2
5   5452 4242     26
6   4823 4823      4
7   5505 5400     11

其中id1id2是图形中的唯一节点,并且size是分配给连接它们 from id1有向边的值 id2.该数据集相当大(超过200万行).我想做的是对大小列求和,并按id1id2无序节点对进行分组.例如,在第一行中,我们有id1=5400id2=5505.数据框中存在另一行,其中id1=5505id2=5400.在分组的数据中,这两行的size列的总和将添加到单行中.因此,换句话说,我想总结我要对(id1,id2)的(无序)集合进行分组的数据.我已经找到了使用apply和自定义函数来执行此操作的方法,该函数可以检查完整数据集中的反向列对,但是这样做的速度非常慢.有人知道用另一种方法做这件事的方法吗,也许是使用plyr或基本软件包中的某些方法会更有效?

Where id1 and id2 are unique nodes in a graph, and size is a value assigned to the directed edge connecting them from id1 to id2. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1 and id2. For example, in the first row, we have id1=5400 and id2=5505. There exists another row in the data frame where id1=5505 and id2=5400. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using apply with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr or with something in the base packages that would be more efficient?

推荐答案

一种方法是使用id1id2pmaxpmin创建额外的列,如下所示.我将在这里使用data.table解决方案.

One way is to create extra columns with pmax and pmin of id1 and id2as follows. I'll use data.table solution here.

require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)), 
        by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
#     id1  id2 size
# 1: 5400 5505   18
# 2: 5033 5458    1
# 3: 5452 2873   24
# 4: 5452 5213    2
# 5: 5452 4242   26
# 6: 4823 4823    4

这篇关于根据无序的列对聚合数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆