使用 lapply 和 which 通过特征和功能对数据帧进行子集 [英] Using lapply and which to subset dataframe by both characteristic and fuction

查看:22
本文介绍了使用 lapply 和 which 通过特征和功能对数据帧进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 5 个数据维度的数据框,如下所示:

>昏暗(所有数据)[1] 162 6>头(所有数据)价值层 Kmultiplier 分辨率 季节变量1: 0.01308008 b .01K 1km 基流蒸散2: 0.03974779 b .01K 1km 峰值流量蒸散3: 0.02396524 b .01K 1km 夏季流量蒸散4: -0.15670996 b .01K 1km 基流排放5: 0.06774948 b .01K 1km 峰值流量放电6: -0.04138313 b .01K 1km 夏季流量排放

我想做的是根据其他列获取数据的某些特征"的值列的平均值.所以我使用 which 将数据子集到我感兴趣的变量,例如:

>subset=alldata[which(alldata$Variable=="Discharge" & alldata$Resolution=="1km" & alldata$Season=="Peak Flow"),]>子集价值层 Kmultiplier 分辨率 季节变量1: 0.067749478 b .01K 1km 峰值流量放电2: 0.058260448 b .1K 1km 峰值流量放电3:-0.223953725 b 10K 1km 峰值流量放电4: 0.272916114 g .01K 1km 峰值流量放电5: 0.240135025 g .1K 1km 峰值流量放电6: -0.216730348 g 10K 1km 峰值流量放电7: 0.088966500 s .01K 1km 峰值流量放电8: -0.018943754 s .1K 1km 峰值流量放电9: -0.008339365 s 10K 1km 峰值流量放电

这就是我被卡住的地方.假设我想要层"列中每个值的向量或平均值列表......所以我最终会得到 3 个数字,一个用于 'b',一个用于 'g',一个用于 's'.我需要制作一堆这样的子集,我认为 apply 函数可以提供帮助,但是在多个教程和堆栈问题之后,我无法让它工作.一个更简单的例子也很好,像这样:

>A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))>名称(A)= c(值",字母",类型",地点")>一种value 字母类型地方1 1 a type1 place12 2 b type1 place13 3 c type1 place24 4 a type2 place25 5 b type2 place36 6 c type2 place37 7 a type3 place18 8 b type3 place19 9 c type3 place2

从这个简单的例子中,我需要列值"的平均值,由字母列出,对于place1",它应该返回一个类似的东西:a=mean value,b=mean value,c=mean value"任何格式都有效.

这是应用功能的工作吗?如果是这样,如何?如果没有,请告诉我更好的替代方法来子集我的数据.

谢谢!

解决方案

感谢您的建议.根据

I have a dataframe with 5 dimensions of data that looks like this:

> dim(alldata)
[1] 162   6
> head(alldata)
         value layer Kmultiplier Resolution      Season           Variable
1:  0.01308008     b        .01K        1km    Baseflow Evapotranspiration
2:  0.03974779     b        .01K        1km   Peak Flow Evapotranspiration
3:  0.02396524     b        .01K        1km Summer Flow Evapotranspiration
4: -0.15670996     b        .01K        1km    Baseflow          Discharge
5:  0.06774948     b        .01K        1km   Peak Flow          Discharge
6: -0.04138313     b        .01K        1km Summer Flow          Discharge

What I'd like to do is get the mean of the value column for certain 'characteristics' of the data based on the other columns. So I use which to subset the data to only the variables I'm interested in, for example:

> subset=alldata[which(alldata$Variable=="Discharge" & alldata$Resolution=="1km" & alldata$Season=="Peak Flow"),]
> subset
          value layer Kmultiplier Resolution    Season  Variable
1:  0.067749478     b        .01K        1km Peak Flow Discharge
2:  0.058260448     b         .1K        1km Peak Flow Discharge
3: -0.223953725     b         10K        1km Peak Flow Discharge
4:  0.272916114     g        .01K        1km Peak Flow Discharge
5:  0.240135025     g         .1K        1km Peak Flow Discharge
6: -0.216730348     g         10K        1km Peak Flow Discharge
7:  0.088966500     s        .01K        1km Peak Flow Discharge
8: -0.018943754     s         .1K        1km Peak Flow Discharge
9: -0.008339365     s         10K        1km Peak Flow Discharge

Here's where I'm stuck. Let's say I want a vector or list of the mean value for each value in the "layer" column... so I would end up with 3 numbers, one for 'b' one for 'g' and one for 's'. I need to make a bunch of subsets like this and I think the apply functions can help, but after multiple tutorials and stack questions I cannot get this to work. A simpler example is fine too, like this:

> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
  value Letter  Type  Place
1     1      a type1 place1
2     2      b type1 place1
3     3      c type1 place2
4     4      a type2 place2
5     5      b type2 place3
6     6      c type2 place3
7     7      a type3 place1
8     8      b type3 place1
9     9      c type3 place2

From this simple example, I need the mean of column "value", listed by Letter, for "place1", which should return a something like: "a=mean value, b=mean value, c=mean value" in whatever format works.

Is this a job for the apply functions? If so, how? If not, let me know a better alternative for subsetting my data.

Thank you!

解决方案

Thanks for the advice. I ended up going with ddply in order to get my data into a more usable format, following general advice from this post.

Here's the simple example:

> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
  value Letter  Type  Place
1     1      a type1 place1
2     2      b type1 place1
3     3      c type1 place2
4     4      a type2 place2
5     5      b type2 place3
6     6      c type2 place3
7     7      a type3 place1
8     8      b type3 place1
9     9      c type3 place2

Then here is my code to find the mean of 'value' for every value that is both place1 and type1:

> sub=ddply(A[which(A$Place=="place1" & A$Type=="type1"),],"value",summarize,mean=mean(value,na.rm=T))
> sub
  value mean
1     1    1
2     2    2

Since 'sub' is already a dataframe, it's easy to add columns with other characteristics and then plot these results.

---------------------------------------------------------------------------------

If you are interested, here is the more complex dataset I was actually trying to subset:

> head(alldata)
        value layer Kmultiplier Resolution      Season           Variable
1: 0.00000000     b           1        1km    Baseflow Evapotranspiration
2: 0.01308008     b         .01        1km    Baseflow Evapotranspiration
3: 0.00000000     b           1        1km   Peak Flow Evapotranspiration
4: 0.03974779     b         .01        1km   Peak Flow Evapotranspiration
5: 0.00000000     b           1        1km Summer Flow Evapotranspiration
6: 0.02396524     b         .01        1km Summer Flow Evapotranspiration

And the lines of code I wrote to subset it into plottable pieces:

  for(j in Season){
    for(i in res){
      ET=ddply(alldata[which(alldata$Variable=="Evapotranspiration" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
      ET$Variable="Evapotranspiration";ET$Resolution=sprintf("%s",i);ET$Season=sprintf("%s",j)
      S=ddply(alldata[which(alldata$Variable=="Change in Storage" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
      S$Variable="Change in Storage";S$Resolution=sprintf("%s",i);S$Season=sprintf("%s",j)
      Q=ddply(alldata[which(alldata$Variable=="Discharge" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
      Q$Variable="Discharge";Q$Resolution=sprintf("%s",i);Q$Season=sprintf("%s",j)
      if(i=="1km"){resbind=rbind(Q,S,ET)}else{resbind2=rbind(resbind,Q,S,ET)}
    } 
    if(j=="Baseflow"){sbind=rbind(resbind2,Q,S,ET)}else if(j=="Peak Flow"){sbind2=rbind(resbind2,sbind,Q,S,ET)}else{ETSQ=rbind(resbind2,sbind2,Q,S,ET)}
  }
  ETSQ$Variable=factor(ETSQ$Variable,levels=c("Change in Storage","Evapotranspiration","Discharge"))
  print(ggplot(data=ETSQ,aes(x=Kmultiplier,y=mean, color=Variable,group=Variable))
        +geom_point()
        +geom_line()
        +labs(x="K scaled by",y="Percent change from Baseline case")
        +scale_y_continuous(labels=percent)
        +facet_grid(Season~Resolution)
        +theme_bw()
  )
  ggsave(sprintf("%s/Plots/SimpleLines/Variable_by_K.png",path),device = NULL,scale=1)

And finally the resulting plot:

这篇关于使用 lapply 和 which 通过特征和功能对数据帧进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆