对一组列应用函数 [英] apply a function over groups of columns

查看:141
本文介绍了对一组列应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用应用或相关函数创建一个新的数据帧,其中包含非常大的数据帧中每对列的行平均值的结果?



我有一个对大量样本进行重复测量的 n 的仪器,其中每个单次测量是矢量(所有测量都是相同的长度矢量)。我想计算每个样本的所有重复测量的平均值(和其他统计数据)。这意味着我需要将 n 连续列组合在一起,并进行逐行计算。



对于一个简单的例子,对两个样本进行三次重复测量,我最终如何得到一个数据帧,它有两列(每个样本一个),一个是 dat $ a dat $ b dat $中的每一行重复的平均值c ,一个是 dat $ d dat $ e 和 dat $ f



以下是一些示例数据



(a = rnorm(16),b = rnorm(16),c = rnorm(16),d = rnorm(16),e = rnorm(16),f = rnorm(16))

abcdef
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578​​ 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.2665 80411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210

我是这样的东西

  X1 X2 
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286

我做了这个,但显然对我的更大的数据框架没有好处...

  data.frame(cbind(
apply(cbind(dat $ a,dat $ b,dat $ c),1,mean),
apply(cbind(dat $ d,dat $ e,dat $ f),1,mean)
))

我试过 apply 并循环,不能很好地得到它。我的实际数据有几百列。

解决方案

这可能是更普遍的你的情况,你通过一个索引列表。如果速度是一个问题(大数据框),我会选择 lapply do.call 而不是 sapply

  x<  - 列表(1:3,4:6 )
do.call(cbind,lapply(x,function(i)rowMeans(dat [,i])))

如果你也有col名称,你可以工作:

  x<  -  list(c(' a,b,c,c('d','e','f'))
do.call(cbind,lapply(x,function(i)rowMeans i])))

编辑



刚刚想到也许你想自动化这样做每三列。我知道有一个更好的方法,但是这是一个100列数据集:

  dat < -  data.frame (c(n,rep(NA,3  - )),(c(n,rep(NA,3-N),ncol = 100))

n < ncol(dat)%% 3)),byrow = TRUE,ncol = 3)
ind < - data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind,function(i)rowMeans(dat [,i])))

编辑2
仍然不满意索引。我认为传递索引有更好/更快的方式。这里是第二个但不令人满意的方法:

  n < -  1:ncol(dat)
ind < data.frame(matrix(c(n,rep(NA,3-ncol(dat)%% 3)),byrow = F,nrow = 3))
nonna< - sapply(ind,function )所有(!is.na(x)))
ind < - ind [,nonna]

do.call(cbind,lapply(ind,function(i)rowMeans [,i])))


How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?

I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.

For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.

Here's some example data

dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))

            a          b            c          d           e          f
1  -0.9089594 -0.8144765  0.872691548  0.4051094 -0.09705234 -1.5100709
2   0.7993102  0.3243804  0.394560355  0.6646588  0.91033497  2.2504104
3   0.2963102 -0.2911078 -0.243723116  1.0661698 -0.89747522 -0.8455833
4  -0.4311512 -0.5997466 -0.545381175  0.3495578  0.38359390  0.4999425
5  -0.4955802  1.8949285 -0.266580411  1.2773987 -0.79373386 -1.8664651
6   1.0957793 -0.3326867 -1.116623982 -0.8584253  0.83704172  1.8368212
7  -0.2529444  0.5792413 -0.001950741  0.2661068  1.17515099  0.4875377
8   1.2560402  0.1354533  1.440160168 -2.1295397  2.05025701  1.0377283
9   0.8123061  0.4453768  1.598246016  0.7146553 -1.09476532  0.0600665
10  0.1084029 -0.4934862 -0.584671816 -0.8096653  1.54466019 -1.8117459
11 -0.8152812  0.9494620  0.100909570  1.5944528  1.56724269  0.6839954
12  0.3130357  2.6245864  1.750448404 -0.7494403  1.06055267  1.0358267
13  1.1976817 -1.2110708  0.719397607 -0.2690107  0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443  0.34302096 -0.8024803
15  0.2361756  0.6773727  1.279737692  0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335  0.753090683  2.0394865  0.79006103  0.5704210

I'm after something like this

            X1          X2
1  -0.28358147 -0.40067128
2   0.50608365  1.27513471
3  -0.07950691 -0.22562957
4  -0.52542633  0.41103139
5   0.37758930 -0.46093340
6  -0.11784382  0.60514586
7   0.10811540  0.64293184
8   0.94388455  0.31948189
9   0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11  0.07836345  1.28189698
12  1.56269017  0.44897971
13  0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15  0.73109533  0.11872758
16 -0.54599850  1.13332286

which I did with this, but is obviously no good for my much larger data frame...

data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))

I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.

解决方案

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

Works if you just have col names too:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

EDIT

Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

EDIT 2 Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

这篇关于对一组列应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆