按组汇总并获得不同数据的非NA值的计数,均值和sd.frame列 [英] Aggregate by group and get count, mean and sd of non-NA values for different data.frame columns

查看:136
本文介绍了按组汇总并获得不同数据的非NA值的计数,均值和sd.frame列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在通过下面的函数按组计算非缺失值时遇到了一些困难(该函数还给出了sd和均值):

test <- do.call(data.frame, aggregate(. ~ treatment, have, function(x) c(n = sum(!is.na(x)), mean = mean(x), sd = sd(x))))

最终,我得到了数据框中所有列而不是单个列的不丢失数量.

我一直在寻找一些建议,并发现了很有帮助,但是我无法弄清楚为什么带有函数(x)的聚合会合并一些用于sum(!is.na(x)的列,而不包括均值或sd的列.

添加表格

这是我拥有的数据

这是我从代码中获得的数据

这是我想要的表格

您会注意到,在具有"数据框中,按治疗组对var1列中不存在的行进行计数会得出以下结果:

veh-9 图4-8 3-10 2-5

但是当使用sum(!is.na(x)时,我得到以下内容

veh-6 图4-5 3-10 2-5

我认为这是因为该函数同时使用var1和var2来求和非缺失数.我不知道该如何纠正.

最好

杰克

解决方案

这是一种data.table方法:

数据

您拥有的数据难以读入R中-请使用dput()等使其他人更容易使用

> dput(dt)
structure(list(someting = c("503", "553", "599", "647", "695", 
"728", "760", "793", "826", "859", "907", "955", "1003", "1036", 
"1084", "1131", "1179", "1226", "1274", "1322", "1355", "1402", 
"1450", "1497", "1545"), treatment = c("gr.2", "gr.2", "gr.2", 
"gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", 
"gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", 
"gr.3", "gr.3", "gr.3", "gr.3", "gr.4", "gr.4"), var1 = c(8, 
NA, 3, 3, NA, NA, NA, NA, NA, 8, 8, 8, NA, 8, 8, 8, 8, 8, 8, 
NA, 8, 8, 8, 8, NA), var2 = c(8L, 8L, 8L, 8L, NA, NA, NA, NA, 
NA, 8L, 8L, 8L, NA, 8L, 8L, 8L, 8L, 8L, 8L, NA, 8L, 8L, 8L, 8L, 
NA)), .Names = c("someting", "treatment", "var1", "var2"), row.names = c(NA, 
-25L), class = c("data.table", "data.frame"))

代码

dt[, .(var1.n = sum(!is.na(var1)),
       var2.n = sum(!is.na(var1)), 
       var1.mean = mean(var1, na.rm = T), 
       var2.mean = mean(var2, na.rm = T)), 
   by = .(treatment)]

输出

      treatment var1.n var2.n var1.mean var2.mean
1:      gr.2      5      5         6         8
2:      gr.3     10     10         8         8
3:      gr.4      1      1         8         8

由于某些原因,未读入"veh"条目.因此,输出略有不同,但原理应明确.

I am having some difficulty counting non-missing values by group through the function below (which also gives sd, and mean):

test <- do.call(data.frame, aggregate(. ~ treatment, have, function(x) c(n = sum(!is.na(x)), mean = mean(x), sd = sd(x))))

It ends up giving me the number of non-missing for all columns in the dataframe instead of just a single column.

I have been looking through SO for some advice and found this, this, and this helpful, but I can't figure out why the aggregate with the function(x) would combine some columns for the sum(!is.na(x), but not for the mean or sd.

EDIT: Adding tables

This is the data I have

This is the data I get from my code

This is the table I want

You will notice in the 'have' dataframe that counting the non-mising rows in column var1 by treatment group gives the following:

veh - 9 gr.4 - 8 gr.3 - 10 gr.2 - 5

But when using the sum(!is.na(x) I get the following

veh - 6 gr.4 - 5 gr.3 - 10 gr.2 - 5

I believe this is because the function is using both var1 and var2 to sum the number of non-missing. I do not know how to correct for this.

Best,

Jack

解决方案

Here's a data.table approach:

DATA

The data you have is cumbersome to read into R - please use dput() etc. to make it easier for others:

> dput(dt)
structure(list(someting = c("503", "553", "599", "647", "695", 
"728", "760", "793", "826", "859", "907", "955", "1003", "1036", 
"1084", "1131", "1179", "1226", "1274", "1322", "1355", "1402", 
"1450", "1497", "1545"), treatment = c("gr.2", "gr.2", "gr.2", 
"gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", "gr.2", 
"gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", "gr.3", 
"gr.3", "gr.3", "gr.3", "gr.3", "gr.4", "gr.4"), var1 = c(8, 
NA, 3, 3, NA, NA, NA, NA, NA, 8, 8, 8, NA, 8, 8, 8, 8, 8, 8, 
NA, 8, 8, 8, 8, NA), var2 = c(8L, 8L, 8L, 8L, NA, NA, NA, NA, 
NA, 8L, 8L, 8L, NA, 8L, 8L, 8L, 8L, 8L, 8L, NA, 8L, 8L, 8L, 8L, 
NA)), .Names = c("someting", "treatment", "var1", "var2"), row.names = c(NA, 
-25L), class = c("data.table", "data.frame"))

CODE

dt[, .(var1.n = sum(!is.na(var1)),
       var2.n = sum(!is.na(var1)), 
       var1.mean = mean(var1, na.rm = T), 
       var2.mean = mean(var2, na.rm = T)), 
   by = .(treatment)]

OUTPUT

      treatment var1.n var2.n var1.mean var2.mean
1:      gr.2      5      5         6         8
2:      gr.3     10     10         8         8
3:      gr.4      1      1         8         8

For some reason the "veh" entries weren't read in. Hence the output is slightly different but the principle ought to be clear.

这篇关于按组汇总并获得不同数据的非NA值的计数,均值和sd.frame列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆