对列组应用函数 [英] apply a function over groups of columns

查看:12
本文介绍了对列组应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何使用 apply 或相关函数来创建一个新的数据框,其中包含一个非常大的数据框中每对列的行平均值的结果?

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?

我有一个仪器可以输出 n 对大量样本的重复测量值,其中每个测量值都是一个向量(所有测量值都是相同的长度向量).我想计算每个样本的所有重复测量的平均值(和其他统计数据).这意味着我需要将 n 连续列组合在一起并进行逐行计算.

I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.

举一个简单的例子,对两个样本进行三次重复测量,我如何最终得到一个有两列(每个样本一列)的数据框,其中一列是 dat 中每行重复的平均值$adat$bdat$c 以及 dat$d 每一行的平均值,dat$edat$f.

For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.

这是一些示例数据

dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))

            a          b            c          d           e          f
1  -0.9089594 -0.8144765  0.872691548  0.4051094 -0.09705234 -1.5100709
2   0.7993102  0.3243804  0.394560355  0.6646588  0.91033497  2.2504104
3   0.2963102 -0.2911078 -0.243723116  1.0661698 -0.89747522 -0.8455833
4  -0.4311512 -0.5997466 -0.545381175  0.3495578  0.38359390  0.4999425
5  -0.4955802  1.8949285 -0.266580411  1.2773987 -0.79373386 -1.8664651
6   1.0957793 -0.3326867 -1.116623982 -0.8584253  0.83704172  1.8368212
7  -0.2529444  0.5792413 -0.001950741  0.2661068  1.17515099  0.4875377
8   1.2560402  0.1354533  1.440160168 -2.1295397  2.05025701  1.0377283
9   0.8123061  0.4453768  1.598246016  0.7146553 -1.09476532  0.0600665
10  0.1084029 -0.4934862 -0.584671816 -0.8096653  1.54466019 -1.8117459
11 -0.8152812  0.9494620  0.100909570  1.5944528  1.56724269  0.6839954
12  0.3130357  2.6245864  1.750448404 -0.7494403  1.06055267  1.0358267
13  1.1976817 -1.2110708  0.719397607 -0.2690107  0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443  0.34302096 -0.8024803
15  0.2361756  0.6773727  1.279737692  0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335  0.753090683  2.0394865  0.79006103  0.5704210

我在做这样的事情

            X1          X2
1  -0.28358147 -0.40067128
2   0.50608365  1.27513471
3  -0.07950691 -0.22562957
4  -0.52542633  0.41103139
5   0.37758930 -0.46093340
6  -0.11784382  0.60514586
7   0.10811540  0.64293184
8   0.94388455  0.31948189
9   0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11  0.07836345  1.28189698
12  1.56269017  0.44897971
13  0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15  0.73109533  0.11872758
16 -0.54599850  1.13332286

我用这个做了,但显然对我更大的数据框没有好处......

which I did with this, but is obviously no good for my much larger data frame...

data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))

我已经尝试了 apply 和循环,但无法完全整合.我的实际数据有数百列.

I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.

推荐答案

这可能更适用于您传递索引列表的情况.如果速度是一个问题(大数据帧),我会选择 lapplydo.call 而不是 sapply:

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

如果您也只有列名也可以使用:

Works if you just have col names too:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

编辑

只是碰巧想也许您想自动执行此操作以每三列执行一次.我知道有更好的方法,但这里是一个 100 列的数据集:

Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

编辑 2仍然对索引不满意.我认为有一种更好/更快的方法来传递索引.这是第二种虽然不令人满意的方法:

EDIT 2 Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

这篇关于对列组应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆