R:通过分组变量将简单函数应用于特定列 [英] R: apply simple function to specific columns by grouped variable

查看:154
本文介绍了R:通过分组变量将简单函数应用于特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,每个人有两个观察结果。
数据集中有100多个变量。
我想为每个人填写缺少的数据,并提供相同变量的可用数据。我可以使用dplyr mutate函数手动执行,但是对于需要填写的所有变量,这样做会很麻烦。



这是我试过的,但是失败了:

 > #这里的数据示例
> #https://www.dropbox.com/s/a0bc69xgxhaeguc/data_xlsc.xlsx?dl=0
> #我已经把它附加到我的工作空间
>
>名称(数据)
[1]ID年龄var1var2var3var4var5var6var7var8var9
>头(数据)
来源:本地数据框[6 x 11]

ID年龄var1 var2 var3 var4 var5 var6 var7 var8 var9
1 1 50 27.5 1.83 92.0 NA NA NA NA NA 5.1
2 1 NA NA NA NA 3.54 30.2 27.9 64.34 60.8 NA
3 2 51 33.7 1.77 105.6不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用不适用范围NA
5 3 43 26.3 1.84 89.1 NA NA NA NA NA 4.8
6 3 NA NA NA NA 3.77 24.4 21.9 67.97 64.2 NA

> #如上所述,对于每个人(ID),年龄和其他变量都缺少值。
> #我想填写每个变量的可用数据的缺失数据,每个ID
>
> #这些是我需要填写
>的变量desired_variables< - names(data [,2:11])$ ​​b $ b>
> #这是我尝试失败
>
> data2< - data%>%group_by(ID)%>%
+ do(
+ for(i in seq_along(desired_variables)){
+ i = na.rm = T)
+}
+)
错误:结果不是位置上的数据框:1,2,3
pre>

第一个人的期望输出:

  ID年龄var1 var2 var3 var4 var5 var6 var7 var8 var9 

1 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1

2 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1


解决方案

这是一个可能的 data.table 解决方案

  library(data.table)
setattr数据,class,data.frame)##如果你的数据是`tbl_df` class
setDT(data)[,(desired_variables):= lapply(.SD,max,na.rm = TRUE ),by = ID] ##你也可以使用`.SDcols`,如果你想指定特定的列
data
#ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
#1:1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
#2:1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
#3:2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
#4:2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
#5:3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8
#6: 3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8


I have a data set with 2 observations for each person. There are more than 100 variables in the data set. I would like to fill in the missing data for each person, with the available data for the same variable. I can do this manually with dplyr mutate function, but it will be cumbersome to do that for all the variables that needs to be filled in.

Here is what I tried, but it failed:

> # Here's data example
> # https://www.dropbox.com/s/a0bc69xgxhaeguc/data_xlsc.xlsx?dl=0
> # I have already attached it to my working space
> 
> names(data)
 [1] "ID"   "Age"  "var1" "var2" "var3" "var4" "var5" "var6" "var7" "var8" "var9"
> head(data)
Source: local data frame [6 x 11]

  ID Age var1 var2  var3 var4 var5 var6  var7 var8 var9
1  1  50 27.5 1.83  92.0   NA   NA   NA    NA   NA  5.1
2  1  NA   NA   NA    NA 3.54 30.2 27.9 64.34 60.8   NA
3  2  51 33.7 1.77 105.6   NA   NA   NA    NA   NA  5.2
4  2  NA   NA   NA    NA 4.05 36.4 38.7 67.75 63.7   NA
5  3  43 26.3 1.84  89.1   NA   NA   NA    NA   NA  4.8
6  3  NA   NA   NA    NA 3.77 24.4 21.9 67.97 64.2   NA

> # As you can see above, for each person (ID) there are missing values for age and other variables.
> # I'd like to fill in missing data with the available data for each variable, for each ID
> 
> #These are the variables that I need to fill in
> desired_variables <- names(data[,2:11])
> 
> # this is my attempt that failed
> 
> data2 <- data %>% group_by(ID) %>% 
+      do(
+      for (i in seq_along(desired_variables)) {
+           i=max(i, na.rm=T)
+      }
+ )
Error: Results are not data frames at positions: 1, 2, 3

Desired output for the first person:

  ID Age var1 var2  var3 var4 var5 var6  var7 var8 var9

1  1  50 27.5 1.83  92.0 3.54 30.2 27.9 64.34 60.8  5.1

2  1  50 27.5 1.83  92.0 3.54 30.2 27.9 64.34 60.8  5.1

解决方案

Here's a possible data.table solution

library(data.table)  
setattr(data, "class", "data.frame") ## If your data is of `tbl_df` class
setDT(data)[, (desired_variables) := lapply(.SD, max, na.rm = TRUE), by = ID] ## you can also use `.SDcols` if you want to specify specific columns
data
#    ID Age var1 var2  var3 var4 var5 var6  var7 var8 var9
# 1:  1  50 27.5 1.83  92.0 3.54 30.2 27.9 64.34 60.8  5.1
# 2:  1  50 27.5 1.83  92.0 3.54 30.2 27.9 64.34 60.8  5.1
# 3:  2  51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7  5.2
# 4:  2  51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7  5.2
# 5:  3  43 26.3 1.84  89.1 3.77 24.4 21.9 67.97 64.2  4.8
# 6:  3  43 26.3 1.84  89.1 3.77 24.4 21.9 67.97 64.2  4.8

这篇关于R:通过分组变量将简单函数应用于特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆