dplyr:按组查找每个bin的平均值 [英] dplyr: Find mean for each bin by groups
问题描述
库(ggplot2)
df< - data.frame(
id = sample (LETTERS [1:3],100,replace = TRUE),
tobin = rnorm(1000),
value = rnorm(1000)
)
df $ tobin [sample (nf(df),10)] = 0
df $ bin = cut_interval(abs(df $ tobin),length = 1)
df $ sign = ifelse(df $ tobin = = 0,NULL,ifelse(df $ tobin> 0, - ,+))
#使用dplyr查找按组,bin和sign的值的平均值
库(dplyr)
res < - df%>%group_by(id,bin,sign)%>%
summaryize(Num = length(bin),value = value,na.rm = TRUE))
res%>%group_by(id)%>%
总结(total = sum(Num))
res = .frame(res)
total = data.frame(total)
res $ total = total [match(res $ id,total $ id),total]
res [res $ id ==A& res $ bin ==[0,1]& res $ sign ==NULL,]
#检入基数R如果按组,bin和符号表示是否正确#有时不是?
groupA = df [df $ id ==A& df $ bin ==[0,1]& df $ sign ==NULL,]
表示(groupA $ value,na.rm = T)
我很疯狂,因为它对我的数据不起作用,这个命令只是重复整个数据集的意思:
code> ddply(df,。(id,bin,sign),summarize,mean = mean(value,na.rm = TRUE))
其中mean等于mean(value,na.rm = TRUE),完全忽略分组...所有组都是因子,数值是数字...
这样做:
with(df,aggregate(df $ value,by = list(id,bin,sign),FUN = function(x)c(mean(x)))
请帮助我..
你似乎在fl。一下。你有正确的代码,那么你有额外的代码。
从新的R会话开始并定义数据,然后
library(dplyr)
res< - df%>%group_by(id,bin,sign)%>%
总结(Num = n(),value = mean(value,na.rm = TRUE))
上面的代码来自你的问题,但是我用内置的 dplyr :: n()
length(bin) >功能。上面的代码准确地给出了分组平均值:
head(res)
#id bin sign Num value
#1 A [0,1] - 122 -0.08330338
#2 A [0,1] + 111 0.11394381
#3 A [0,1] NULL 2 0.75232462
# 4 A(1,2] - 54 -0.09236725
#5 A(1,2)+ 45 0.20581095
#6 A(2,3] - 12 -0.08998771
向代码块中跳过最后几条线:
groupA = df [df $ id ==A& df $ bin ==[0,1]& df $ sign ==NULL,]
# mean(groupA $ value,na.rm = T)
#[1] 0.7523246
其中匹配上述结果的第三行,所以你这样做,它的工作正常!
其余的代码很困惑:
res%>%group_by(id)%>%
总结(total = sum(Num))
我不知道你想要完成什么,但你不屁股
至于您的 ddply
尝试:
ddply(df,。(id,bin,sign),summarize,mean = mean(value,na.rm = TRUE))
你会注意到,如果你有 dplyr
加载然后加载 plyr
库,有一条消息:
您已加载dlyr之后,这很可能会导致问题。
如果您需要plyr和dplyr的函数,请先加载plyr,然后dplyr:
library(plyr);图书馆(dplyr)
不要忽略此警告!我猜这是发生了,你忽略它,这是你的烦恼的一部分。可能您根本不需要 plyr
,但如果您这样做,请在之前加载 dplyr
!
I am trying to understand dplyr. I am splitting values in my data frame by group, bins and by sign, and I am trying to get a mean value for each group/bin/sign combination. I would like to output a data frame with these counts per each group/bin/sign combination, and the total numbers per each group. I think I have it but sometimes I get different values in base R compared to the output of ddplyr. Am I doing this correctly? It is also very contorted...is there a more direct way? Thank you!
library(ggplot2)
df <- data.frame(
id = sample(LETTERS[1:3], 100, replace=TRUE),
tobin = rnorm(1000),
value = rnorm(1000)
)
df$tobin[sample(nrow(df), 10)]=0
df$bin = cut_interval(abs(df$tobin), length=1)
df$sign = ifelse(df$tobin==0, "NULL", ifelse(df$tobin>0, "-", "+"))
# Find mean of value by group, bin, and sign using dplyr
library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = length(bin), value=mean(value,na.rm=TRUE))
res %>% group_by(id) %>%
summarise(total= sum(Num))
res=data.frame(res)
total=data.frame(total)
res$total = total[match(res$id, total$id),"total"]
res[res$id=="A" & res$bin=="[0,1]" & res$sign=="NULL",]
# Check in base R if mean by group, bin, and sign is correct # Sometimes not?
groupA = df[df$id=="A" & df$bin=="[0,1]" & df$sign=="NULL",]
mean(groupA$value, na.rm=T)
I am going crazy because it doesn't work on my data, and this command just repeats the mean of the whole dataset:
ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))
Where mean is equal to mean(value,na.rm=TRUE), completely ignoring the grouping...All the groups are factors, and the value is numeric...
This however works:
with(df, aggregate(df$value, by = list(id, bin, sign), FUN = function(x) c(mean(x))))
Please help me..
You seem to be flailing a bit. You've got correct code, then you've got extra code.
Starting from a fresh R session and defining your data, then
library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = n(), value = mean(value,na.rm=TRUE))
The above code is from your question, but I replaced length(bin)
with the built-in dplyr::n()
function. The above code accurately gives the group-wise averages:
head(res)
# id bin sign Num value
# 1 A [0,1] - 122 -0.08330338
# 2 A [0,1] + 111 0.11394381
# 3 A [0,1] NULL 2 0.75232462
# 4 A (1,2] - 54 -0.09236725
# 5 A (1,2] + 45 0.20581095
# 6 A (2,3] - 12 -0.08998771
Jumping ahead to your last couple lines in the code block:
groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246
Which matches the 3rd line of the above results. So you did it, it works fine!
The rest of your code is confused:
res %>% group_by(id) %>%
summarise(total= sum(Num))
I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.
As for your ddply
attempt:
ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))
You'll notice that if you have dplyr
loaded and then load the plyr
library, there's a message that:
You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)
Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr
at all, but if you do, load it before dplyr
!
这篇关于dplyr:按组查找每个bin的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!