汇总相同因素的计数和条件聚合函数 [英] Summarizing count and conditional aggregate functions on the same factor
问题描述
快速而简短的是,我在汇总具有相同因素的条件下的计数和聚合函数时遇到问题。
Quick and short of it is I'm having problems summarizing count and aggregate functions with conditions on the same factor.
假设我有这个数据框:
library(dplyr)
df = tbl_df(data.frame(
company=c("Acme", "Meca", "Emca", "Acme", "Meca", "Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"),
product=c("Wrench", "Hammer", "Sonic Screwdriver", "Fairy Dust",
"Kindness", "Helping Hand"),
price=c("5.67", "7.12", "12.99", "10.99", NA, FALSE)))
(本质上)创建此数据帧:
which creates this dataframe (in essence):
company year product price
1 Acme 2011 Wrench 5.67
2 Meca 2010 Hammer 7.12
3 Emca 2009 Sonic Screwdriver 12.99
4 Acme 2011 Fairy Dust 10.99
5 Meca 2010 Kindness NA
... ... ... ... ...
n Emca 2013 Helping Hand FALSE
假设我要 df<-group_by(df,公司,年份,产品)
,然后将以下信息全部收集到一个集合中(即数据框):
Let's say I want to df <- group_by(df, company, year, product)
and then get the following info all in one collection (i.e. dataframe):
- 每个价格清单的计数(包括NA,False)
- 每个价格清单条件为不适用
- 平均价格,不包括不适用和不正确
-
最高价格
- Count of each price listing (including NA, False)
- Count of each with 'NA' condition
- Average price excluding NA and False
Max price
summarize(df, count = n()) #satisfies first item obviously
我遇到了其他问题。我想我需要使用管道运算符吗?如果是这样,有人可以提供一些指导吗?
I'm having issues trying to get the others. I think I need to use pipe operators? If so, can anyone provide some guidance?
这是我尝试过的方法,这显然是错误的,但是我不确定下一步该怎么做:
This is what I've tried and it is blatantly wrong, but I'm not sure where to go next:
summarize(df,
total.count = n(),
count = filter(df, is.na(price)),
avg.price = filter(df, !is.na(price), price != FALSE),
max.price = max(filter(df, !is.na(price), price != FALSE))
是的,我已经查看了文档,
And yes, I have reviewed documentation and I'm sure the answers are there, but they might be too advanced for my understanding. Thanks in advance!
推荐答案
假设您的原始数据集是肯定的,但是对于我的理解来说可能太高了。与您创建的类似(即使用 NA
作为字符
。您可以指定 na .strings
,同时使用 read.table
读取数据。但是,我猜NA会被自动检测到。
Assuming that your original dataset is similar to the one you created (i.e. with NA
as character
. You could specify na.strings
while reading the data using read.table
. But, I guess NAs would be detected automatically.
价格
列是因子
,需要转换为数字
类。当您使用 as.numeric
时,所有非数字元素(即<$ c $ c> NA ,即FALSE)都被强制转换为 NA
)并带有警告。
The price
column is factor
which needs to be converted to numeric
class. When you use as.numeric
, all the non-numeric elements (i.e. "NA"
, FALSE) gets coerced to NA
) with a warning.
library(dplyr)
df %>%
mutate(price=as.numeric(as.character(price))) %>%
group_by(company, year, product) %>%
summarise(total.count=n(),
count=sum(is.na(price)),
avg.price=mean(price,na.rm=TRUE),
max.price=max(price, na.rm=TRUE))
数据
我使用的是相同的数据集
(除了 ...
行之外)被显示。
data
I am using the same dataset
(except the ...
row) that was showed.
df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
"Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
"7.12", "12.99", "10.99", "NA",FALSE)))
这篇关于汇总相同因素的计数和条件聚合函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!