R 将函数应用于数据框的子集 [英] R applying a function to a subset of a data frame

查看:32
本文介绍了R 将函数应用于数据框的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上广泛浏览,但没有看到这个特定问题的答案(我认为).

I looked online extensively and did not see an answer to this particular question (I think).

我解释自己的最佳方式是使用一些代码来复制我的问题.我做了一些临时数据:

The best way for me to explain myself will be with some code that replicates my problem. I made some temp data:

x <- runif(100,1,2)
y <- runif(100,2,3)

z <- c(rep(1,100))
temp <- cbind(x,y,z)

temp[1:25,3] = temp[1:25,3] +2

temp <- as.data.frame(temp)

这就是温度的样子

         x        y   z
1   1.512620 2.552271 3
2   1.133614 2.455296 3
3   1.543242 2.490120 3
4   1.047618 2.069474 3
.      .        .     .
.      .        .     .
27  1.859012 2.687665 1
28  1.231450 2.196395 1

它会一直持续到数据帧的结尾(100 行).

and it continue on until the end of the data frame (100 rows).

我想要做的是将函数应用于数据框但应用于数据的子集.因此,例如,当 z=3 时,我想将函数均值应用于 x 和 y 列,并在 z=1 时将函数均值应用于 x 和 y 列.所以我最终会得到 4 个值:当 z=1 和 z=3 时 x 的平均值以及当 z=1 和 z=3 时 y 的平均值.对于我的实际数据集,当 z= 某个值时的行数变化很大.

What I want to do is apply a function to the data frame BUT to subsets of the data. So, for example, I want to apply the function mean to the columns x and y for when z=3 and apply the function mean to the columns x and y for when z=1. So I would end up with 4 values: the mean of x when z=1 and when z=3 and the mean of y when z=1 and z=3. For my actual dataset the number of rows for when z= some value varies a lot.

我一直在使用以下有效的代码;然而,这让我感到非常不安,因为我觉得代码可以更高效,并且最好避免 for 循环.

I have been using the following code which does work; however, it makes me feel very uneasy since I feel like the code could be more efficient AND ideally avoid a for loop.

x <- c(unique(temp$z))

我使用那个 ^^ 来获得唯一的 z 值(在这种情况下 z=3 和 z=1).

I use that ^^ to get the unique z values (in this case z=3 and z=1).

for(i in x){
  assign(paste("newdata",i,sep=""),subset(temp[which(temp$z==i),],select=c("x","y")))
} 

所以我现在有两个新的数据框 newdata1 和 newdata3 ,它们的行数不同.newdata1 具有 z=1 时的所有值,newdata3 具有 z=3 时的所有值.

So I now have two new data frames newdata1 and newdata3 that don't have the same number of rows. newdata1 has all the values when z=1 and newdata3 has all the values when z=3.

library(gdata)

blah <-cbindX(newdata1,newdata3)

我再次使用 cbindX 将子集数据合并为一个大数据框.我不确定我为什么要这样做(我很久以前制作了这段代码).我只记得当我使用上面的 for 循环时,这是我让它工作的唯一方法.代码的主要问题是当我有多个 z 值时,手动输入该列表变得非常麻烦.如果 z 的范围从 1 到 50,那么用户将输入 newdata1、newdata2、newdata3 .... 等.

I use cbindX to combine the subsetted data into one large data frame again. I am not sure why I do this exactly (I made this code a long time ago). All I remember is this is the only way I could get it to work when I use the for loop above. The main problem with the code is when I have multiple z values then manually typing in that list becomes very cumbersome. If z ranged from 1 to 50 then a user would type in newdata1, newdata2, newdata3 .... etc.

但是...它确实有效:

But... it does work:

summ.test <- apply(blah,2,function(x) { 
c(min(x,na.rm=TRUE),median(x,na.rm=TRUE),max(x,na.rm=TRUE),sum(!is.na(x)))})

         x         y         x         y
[1,]  1.028332  2.018162  1.012379  2.009595
[2,]  1.509049  2.504000  1.427981  2.455296
[3,]  1.992704  2.998483  1.978359  2.970695
[4,] 75.000000 75.000000 25.000000 25.000000

所以我有效地做的是创建一个新的数据框,其中包含我之前子集的值,并将感兴趣的函数应用于它们.所以第一行是:z=1时x的均值,z=1时y的均值,z=3时x的均值,z=3时y的均值.

So what I effectively did is create a new data frame with the values I subsetted from before and applied the functions of interest to them. So the first row is: mean of x when z=1, mean of y when z=1, mean of x when z=3, mean of y when z=3.

应该相当明显的主要问题:对数据框进行子集化的 for 循环方法会导致比我希望的更多的问题.有什么建议可以完全避免这种情况并最终得到相同的结果?

Main problems that should be fairly obvious: the for loop method to subset the data frame causes more problems then I'd hope. Any recommendations to avoid that entirely and still end up with the same result?

请让我知道这是否令人困惑,或者我的代码是否只是草率!仍在处理此网站上的格式化问题..

Please let me know if any of this is confusing or if I my code is just plain sloppy! Still working at formatting questions on this site also..

推荐答案

> aggregate( . ~ z, data=temp, FUN=mean)
  z        x        y
1 1 1.505304 2.474642
2 3 1.533418 2.477191

当您将相同的函数应用于另一列的类别中的多个列时,请考虑聚合".这是 taht 采用公式参数的版本,其中波浪号前的点"表示获得除z"之外的所有列的平均值.

When you will be applying the same function to multiple columns within categories of another column think about 'aggregate'. This is the version taht takes a formula argument where the "dot" before the tilde says to get the mean of all of the columns besides "z".

这篇关于R 将函数应用于数据框的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆