应用group_by并汇总数据,同时保留所有列的信息 [英] Applying group_by and summarise on data while keeping all the columns' info

查看:375
本文介绍了应用group_by并汇总数据,同时保留所有列的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含22000行和25列的大型数据集。我正在尝试根据其中一列对数据集进行分组,并根据已分组的数据集获取另一列的最小值。但是,问题在于它只给了我两列,分别是分组列和具有最小值的列...但是我需要与具有最小值的行相关的其他列的所有信息。
这是一个简单的示例,只是使其具有可重复性:

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

因此,我还需要具有与列a和d有关的信息,因为我的值重复列c我不能基于min_value列合并它们...我想知道当我们使用dplyr软件包时是否有任何方法可以保留其他列的信息。

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

我在这里找到了一些解释 dplyr:group_by,子集和摘要,此处为 使用group_by和summarise查找子组中的百分比,但没有一个解决我的问题。

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

推荐答案

这里有两个选项,使用a)过滤器和b)切片来自dplyr。在这种情况下,任何组的 c 列均没有重复的最小值,因此a)和b)的结果相同。如果存在重复的最小值,则方法a)将返回每个组中的每个最小值,而方法b)将仅返回每个组中的一个最小值(第一个最小值)。

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

或类似地

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

b)

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

这篇关于应用group_by并汇总数据,同时保留所有列的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆