聚合数据帧,同时保持原始顺序,以一种简单的方式 [英] Aggregate data frame while keeping original order, in a simple manner

查看:145
本文介绍了聚合数据帧,同时保持原始顺序,以一种简单的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在聚合数据框时遇到一些麻烦,同时保持组的原始顺序(基于数据框中的第一次出现的顺序)。



这里是一个示例数据集,可以处理:

  set.seed(7)
sel.1 sel.2< - sample(1:5,20,replace = TRUE)
add.1 add .2 < - sample(81:100)
orig.df < - data.frame(sel.1,sel.2,add.1,add.2)

有几点要注意:有两个选择列用于确定数据如何分组在一起。他们将是相同的,他们的名字是已知的。我只在这个数据中添加了两个列,但可能还有更多。我给出了以'sel'和'add'开头的列名称,以使它更容易遵循,但实际数据有不同的名称(因此 grep 它们在这里不会有用)。



我想做的是根据'sel'列将数据帧聚合成组,所有添加列。这很简单,使用 aggregate 如下:

 #获取名称的所有其他列
all.add< - names(orig.df)[!(names(orig.df))%in%c(sel.1,sel.2)]
aggr.df < - aggregate(orig.df [,all.add],
by = list(sel.1 = orig.df $ sel.1,sel.2 = orig.df $ sel .2),sum)

问题是结果是由'sel'列排序的;



这是我最好的尝试做这项工作:



<$ c $ p> ##尝试1
#为每一行创建索引(x)并找到每个范围的最小索引
index.df < - aggregate (x = 1:nrow(orig.df),
by = list(sel.1 = orig.df $ sel.1,sel.2 = orig.df $ sel.2),min)
#确保x向量(索引)在aggr.df
的正确范围内index.order < - (1:nrow(index.df))[order(index.df $ x)]
aggr.df [index.order,]

##尝试2
#获取唯一的组。这些都是正确的顺序。
unique.sel< - unique(orig.df [,c(sel.1,sel.2)])
#使用sapply有效地循环数据和汇总其他列。
sums< - t(sapply(1:nrow(unique.sel),function(x){
sapply(all.add,function(y){
sum(aggr.df [which(aggr.df $ sel.1 == unique.sel $ sel.1 [x]&
aggr.df $ sel.2 == unique.sel $ sel.2 [x]),y ])
})
}))
data.frame(unique.sel,sums)

虽然这些给了我正确的结果,我希望有人能指出一个更简单的解决方案。如果解决方案与标准R安装附带的软件包一起工作,这将是最好的。



我查看了 aggregate 匹配的文档,但我找不到答案(我想我希望像 aggregate 的keep.original.order参数)。



任何帮助将非常感谢!






更新:任何人绊倒这个)



这是我在尝试几天后可以找到的最干净的方式:

  unique(data.frame(sapply(names(orig.df),function(x){
if(x%in%c(sel.1, sel.2))orig.df [,x] else
ave(orig.df [,x],orig.df $ sel.1,orig.df $ sel.2,FUN = sum)},
simplify = FALSE)))


解决方案

比较难读,但它给了你想要的,我添加了一些注释来澄清。

 #定义你想要的列以组合到分组变量
sel.col< - grepl(^ sel,names(orig.df))
#创建分组变量
lev < - apply .df [sel.col],1,paste,collapse =)
#拆分并汇总
data.frame(unique(orig.df [sel.col]),
t (sapply(orig.df [!sel.col],factor(lev,levels = unique(lev))),
apply,2,sum)))

输出如下所示:

  sel。 1 sel.2 add.1 add.2 
1 5 4 96 84
2 2 2 175 176
3 1 5 384 366
5 2 5 95 89
6 4 1 174 192
7 2 4 82 87
8 5 3 91 98
10 3 2 189 178
11 1 4 170 183
14 1 1 100 91
17 3 3 81 82
19 5 5 83 88
20 2 3 90 96


I'm having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I've managed to get it right, but was hoping there is an easier way to go about it.

Here is a sample data set to work on:

set.seed(7)
sel.1 <- sample(1:5, 20, replace = TRUE)     # selection vector 1
sel.2 <- sample(1:5, 20, replace = TRUE)
add.1 <- sample(81:100)                      # additional vector 1
add.2 <- sample(81:100)
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)

Some points to note: there are two selection columns to determine how the data is grouped together. They will be the same, and their names are known. I have only put two additional columns in this data, but there may be more. I have given the columns names starting with 'sel' and 'add' to make it easier to follow, but the actual data has different names (so while grep tricks are cool, they won't be useful here).

What I'm trying to do is aggregate the data frame into groups based on the 'sel' columns, and to sum together all the 'add' columns. This is simple enough using aggregate as follows:

# Get the names of all the additional columns
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
aggr.df <- aggregate(orig.df[,all.add], 
                     by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The problem is that the result is ordered by the 'sel' columns; I want it ordered based on each group's first appearance in the original data.

Here are my best attempts at making this work:

## Attempt 1
# create indices for each row (x) and find the minimum index for each range
index.df <- aggregate(x = 1:nrow(orig.df),
                      by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min)
# Make sure the x vector (indices) are in the right range for aggr.df
index.order <- (1:nrow(index.df))[order(index.df$x)]
aggr.df[index.order,]

## Attempt 2
# get the unique groups. These are in the right order.
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")])
# use sapply to effectively loop over data and sum additional columns.
sums <- t(sapply(1:nrow(unique.sel), function (x) {
    sapply(all.add, function (y) {
        sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] & 
                          aggr.df$sel.2 == unique.sel$sel.2[x]), y])
        })
}))
data.frame(unique.sel, sums)

While these give me the right result, I was hoping that somebody could point out a simpler solution. It would be preferable if the solution works with the packages that come with the standard R installation.

I've looked at the the documentation for aggregate and match, but I couldn't find an answer (I guess I was hoping for something like a "keep.original.order" parameter for aggregate).

Any help would be much appreciated!


Update: (in case anybody stumbles across this)

Here is the cleanest way that I could find after trying for a few more days:

unique(data.frame(sapply(names(orig.df), function(x){
    if(x %in% c("sel.1", "sel.2")) orig.df[,x] else
    ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)},
simplify=FALSE)))

解决方案

A bit tough to read, but it gives you what you want and I added some comments to clarify.

# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
           t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
                    apply, 2, sum)))

The output looks like this

   sel.1 sel.2 add.1 add.2
1      5     4    96    84
2      2     2   175   176
3      1     5   384   366
5      2     5    95    89
6      4     1   174   192
7      2     4    82    87
8      5     3    91    98
10     3     2   189   178
11     1     4   170   183
14     1     1   100    91
17     3     3    81    82
19     5     5    83    88
20     2     3    90    96

这篇关于聚合数据帧,同时保持原始顺序,以一种简单的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆