使用lapply并通过特征和功能对数据帧进行子集化 [英] Using lapply and which to subset dataframe by both characteristic and fuction
问题描述
我有一个数据框,其中包含5个维度的数据,如下所示:
I have a dataframe with 5 dimensions of data that looks like this:
> dim(alldata)
[1] 162 6
> head(alldata)
value layer Kmultiplier Resolution Season Variable
1: 0.01308008 b .01K 1km Baseflow Evapotranspiration
2: 0.03974779 b .01K 1km Peak Flow Evapotranspiration
3: 0.02396524 b .01K 1km Summer Flow Evapotranspiration
4: -0.15670996 b .01K 1km Baseflow Discharge
5: 0.06774948 b .01K 1km Peak Flow Discharge
6: -0.04138313 b .01K 1km Summer Flow Discharge
我想做的就是获取基于其他列的某些特征"数据的value列的平均值.因此,我使用哪个子集将数据仅子集到我感兴趣的变量,例如:
What I'd like to do is get the mean of the value column for certain 'characteristics' of the data based on the other columns. So I use which to subset the data to only the variables I'm interested in, for example:
> subset=alldata[which(alldata$Variable=="Discharge" & alldata$Resolution=="1km" & alldata$Season=="Peak Flow"),]
> subset
value layer Kmultiplier Resolution Season Variable
1: 0.067749478 b .01K 1km Peak Flow Discharge
2: 0.058260448 b .1K 1km Peak Flow Discharge
3: -0.223953725 b 10K 1km Peak Flow Discharge
4: 0.272916114 g .01K 1km Peak Flow Discharge
5: 0.240135025 g .1K 1km Peak Flow Discharge
6: -0.216730348 g 10K 1km Peak Flow Discharge
7: 0.088966500 s .01K 1km Peak Flow Discharge
8: -0.018943754 s .1K 1km Peak Flow Discharge
9: -0.008339365 s 10K 1km Peak Flow Discharge
这是我被困的地方.假设我想要层"列中每个值的向量或均值列表...因此,我最终得到3个数字,一个代表"b",一个代表"g",一个代表"s".我需要制作一堆这样的子集,并且我认为apply函数可以提供帮助,但是经过多次教程和堆栈问题后,我无法使它起作用.一个更简单的示例也可以,例如:
Here's where I'm stuck. Let's say I want a vector or list of the mean value for each value in the "layer" column... so I would end up with 3 numbers, one for 'b' one for 'g' and one for 's'. I need to make a bunch of subsets like this and I think the apply functions can help, but after multiple tutorials and stack questions I cannot get this to work. A simpler example is fine too, like this:
> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
value Letter Type Place
1 1 a type1 place1
2 2 b type1 place1
3 3 c type1 place2
4 4 a type2 place2
5 5 b type2 place3
6 6 c type2 place3
7 7 a type3 place1
8 8 b type3 place1
9 9 c type3 place2
在这个简单的示例中,我需要以字母形式列出的值"列的平均值为"place1",该列应返回类似"a =平均值,b =平均值,c =平均值"的内容无论哪种格式都可以.
From this simple example, I need the mean of column "value", listed by Letter, for "place1", which should return a something like: "a=mean value, b=mean value, c=mean value" in whatever format works.
这是Apply功能的工作吗?如果是这样,怎么办?如果没有,请告诉我一个更好的子集数据替代方法.
Is this a job for the apply functions? If so, how? If not, let me know a better alternative for subsetting my data.
谢谢!
推荐答案
Thanks for the advice. I ended up going with ddply in order to get my data into a more usable format, following general advice from this post.
这是简单的例子:
> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
value Letter Type Place
1 1 a type1 place1
2 2 b type1 place1
3 3 c type1 place2
4 4 a type2 place2
5 5 b type2 place3
6 6 c type2 place3
7 7 a type3 place1
8 8 b type3 place1
9 9 c type3 place2
然后这是我的代码,用于查找每个同时为place1和type1的值的'value'均值:
Then here is my code to find the mean of 'value' for every value that is both place1 and type1:
> sub=ddply(A[which(A$Place=="place1" & A$Type=="type1"),],"value",summarize,mean=mean(value,na.rm=T))
> sub
value mean
1 1 1
2 2 2
由于'sub'已经是一个数据框,因此很容易添加具有其他特征的列,然后绘制这些结果.
Since 'sub' is already a dataframe, it's easy to add columns with other characteristics and then plot these results.
如果您感兴趣的话,这是我实际上试图子集化的更复杂的数据集:
If you are interested, here is the more complex dataset I was actually trying to subset:
> head(alldata)
value layer Kmultiplier Resolution Season Variable
1: 0.00000000 b 1 1km Baseflow Evapotranspiration
2: 0.01308008 b .01 1km Baseflow Evapotranspiration
3: 0.00000000 b 1 1km Peak Flow Evapotranspiration
4: 0.03974779 b .01 1km Peak Flow Evapotranspiration
5: 0.00000000 b 1 1km Summer Flow Evapotranspiration
6: 0.02396524 b .01 1km Summer Flow Evapotranspiration
我编写的将其子集化为可绘制图块的代码行:
And the lines of code I wrote to subset it into plottable pieces:
for(j in Season){
for(i in res){
ET=ddply(alldata[which(alldata$Variable=="Evapotranspiration" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
ET$Variable="Evapotranspiration";ET$Resolution=sprintf("%s",i);ET$Season=sprintf("%s",j)
S=ddply(alldata[which(alldata$Variable=="Change in Storage" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
S$Variable="Change in Storage";S$Resolution=sprintf("%s",i);S$Season=sprintf("%s",j)
Q=ddply(alldata[which(alldata$Variable=="Discharge" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
Q$Variable="Discharge";Q$Resolution=sprintf("%s",i);Q$Season=sprintf("%s",j)
if(i=="1km"){resbind=rbind(Q,S,ET)}else{resbind2=rbind(resbind,Q,S,ET)}
}
if(j=="Baseflow"){sbind=rbind(resbind2,Q,S,ET)}else if(j=="Peak Flow"){sbind2=rbind(resbind2,sbind,Q,S,ET)}else{ETSQ=rbind(resbind2,sbind2,Q,S,ET)}
}
ETSQ$Variable=factor(ETSQ$Variable,levels=c("Change in Storage","Evapotranspiration","Discharge"))
print(ggplot(data=ETSQ,aes(x=Kmultiplier,y=mean, color=Variable,group=Variable))
+geom_point()
+geom_line()
+labs(x="K scaled by",y="Percent change from Baseline case")
+scale_y_continuous(labels=percent)
+facet_grid(Season~Resolution)
+theme_bw()
)
ggsave(sprintf("%s/Plots/SimpleLines/Variable_by_K.png",path),device = NULL,scale=1)
最后是结果图:
这篇关于使用lapply并通过特征和功能对数据帧进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!