以可读的方式获得不同的数量 [英] Get dplyr count of distinct in a readable way

查看:125
本文介绍了以可读的方式获得不同的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新的使用dplyr,
我需要计算一个组中的不同的值。以下是一个表格示例:

  data = data.frame(aa = c(1,2,3,4,NA) bb = c('a','b','a','c','c'))


$ b $我知道我可以做些事情:

  by_bb< -group_by(data,bb,add = TRUE)
summary(by_bb,mean(aa,na.rm = TRUE),max(aa),sum(!is.na(aa)),length(aa))



但是,如果我想要唯一元素的数量?



我可以做:

 >总结(by_bb,length(unique(unlist(aa))))

bb length(unique(unlist(aa)))
1 a 2
2 b 1
3 c 2

如果我要排除NAs我可以做:

 >总结(by_bb,length(unique(unlist(aa [!is.na(aa)])))

bb length(unique(unlist(aa [!is.na(aa)]) ))
1 a 2
2 b 1
3 c 1

但是这是一个对我来说不太可读。有没有更好的方法来做这种总结?

解决方案

这个选项如何:

  data%>%#取data.framedata
filter(!is.na(aa))%>%#使用data,使用aa
group_by(bb)%>%#中的NAs过滤所有行然后,使用过滤后的数据将其分组为bb
summarize(Unique_Elements = n_distinct(aa) )#现在总结每组的独特元素

#Source:本地数据框架[3 x 2]

#bb Unique_Elements
#1 a 2
#2 b 1
#3 c 1

使用过滤器以过滤掉 aa 具有NAs的任何行,然后按照 bb 然后通过按 bb 的组计数列 aa 的唯一元素的数量进行总结。



正如你所看到的,我正在使用管道运算符%>%可以在使用dplyr时一起使用管道或链接命令。这有助于您编写易读的代码,因为它更自然,例如你编写的代码从左到右,从上到下,不是深入嵌套(如你的例子代码)。



编辑:



在你的问题的第一部分,你写道:


我知道我可以做的事情: / p>

  by_bb< -group_by(data,bb,add = TRUE)
summaryize(by_bb,mean(aa,na。 rm = TRUE),max(aa),sum(!is.na(aa)),length(aa))


< blockquote>

这是另一个选择(将多个函数应用于同一列):

  data%>%
过滤器(!is.na(aa))%>%
group_by(bb)%>%
summarise_each (平均值,最大和总和,n_distinct),aa)

#Source:本地数据框架[3 x 5]

#bb平均值最大和n_distinct
#1 a 2 3 4 2
#2 b 2 2 2 1
#3 c 4 4 4 1


I'm new using dplyr, I need to calculate the distinct values in a group. Here's a table example:

data=data.frame(aa=c(1,2,3,4,NA), bb=c('a', 'b', 'a', 'c', 'c'))

I know I can do things like:

by_bb<-group_by(data, bb, add = TRUE)
summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa))

But if I want the count of unique elements?

I can do:

  > summarise(by_bb,length(unique(unlist(aa))))

  bb length(unique(unlist(aa)))
1  a                          2
2  b                          1
3  c                          2

and if I want to exclude NAs I cand do:

> summarise(by_bb,length(unique(unlist(aa[!is.na(aa)]))))

  bb length(unique(unlist(aa[!is.na(aa)])))
1  a                                      2
2  b                                      1
3  c                                      1

But it's a little unreadable for me. Is there a better way to do this kind of summarization?

解决方案

How about this option:

data %>%                    # take the data.frame "data"
  filter(!is.na(aa)) %>%    # Using "data", filter out all rows with NAs in aa 
  group_by(bb) %>%          # Then, with the filtered data, group it by "bb"
  summarise(Unique_Elements = n_distinct(aa))   # Now summarise with unique elements per group

#Source: local data frame [3 x 2]
#
#  bb Unique_Elements
#1  a               2
#2  b               1
#3  c               1

Use filter to filter out any rows where aa has NAs, then group the data by column bb and then summarise by counting the number of unique elements of column aa by group of bb.

As you can see I'm making use of the pipe operator %>% which you can use to "pipe" or "chain" commands together when using dplyr. This helps you write easily readable code because it's more natural, e.g. you write code from left to write and top to bottom and not deeply nested from inside out (as in your example code).

Edit:

In the first part of your question, you wrote:

I know I can do things like:

by_bb<-group_by(data, bb, add = TRUE)
summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa))

Here's another option to do that (applying a number of functions to the same column(s)):

data %>%
  filter(!is.na(aa)) %>%
  group_by(bb) %>%
  summarise_each(funs(mean, max, sum, n_distinct), aa)

#Source: local data frame [3 x 5]
#
#  bb mean max sum n_distinct
#1  a    2   3   4          2
#2  b    2   2   2          1
#3  c    4   4   4          1

这篇关于以可读的方式获得不同的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆