如何更有效地计算数据帧的多个子集的斜率? [英] How can I calculate the slope of multiple subsets of a data frame more efficiently?
问题描述
我有一个数据集,该数据集包含16个不同样本(bod 的波长( wl
)在一定波长范围内的光吸收( a
)>)在5个不同的天( day
)上.下面是示例( bod
)1-3的 dput
输出.
I have a data set that contains the optical absorption (a
) across a range of wavelengths (wl
) for 16 different samples (bod
) on 5 different days (day
). The dput
output for samples (bod
) 1 - 3 is below.
我需要每天每个样本的吸收率的自然对数的斜率与波长的关系.
I need the slope of the natural log of the absorption by wavelength for each sample on each day.
我目前的方法是手动执行此操作:
My current approach has been to do this manually:
# calculate and extract the slope for each sample and date
s275.1.0 <- lm(log(a) ~ wl, data = spec275, subset = bod == 1 & day == "2014-06-10")
s275.1.0.slope <- coef(s275.1.0)["wl"]
s275.2.0 <- lm(log(a) ~ wl, data = spec275, subset = bod == 2 & day == "2014-06-10")
s275.2.0.slope <- coef(s275.1.0)["wl"]
# etc...
# combine slopes into a vector
s275.slopes <- c(s275.1.0, s275.2.0) # etc...
显然,这很繁琐.是否有一种简单的方法可以简化此代码,以使R在所有样本和日期中迭代这些计算?
Obviously this is rather tedious. Is there a straightforward way to streamline this code to get R to iterate these calculations over all of the samples and days?
structure(list(bod = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
CPOM = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
Nutrient = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
day = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L, 2L, 5L,
2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 5L, 5L,
5L, 2L, 3L, 5L, 2L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 4L, 4L), .Label = c("2014-06-10", "2014-06-12",
"2014-06-17", "2014-06-24", "2014-07-01"), class = "factor"),
wl = c(295L, 279L, 283L, 286L, 289L, 292L, 284L, 279L, 282L,
276L, 288L, 291L, 294L, 276L, 275L, 278L, 281L, 284L, 287L,
290L, 293L, 275L, 278L, 281L, 284L, 287L, 290L, 293L, 288L,
281L, 284L, 287L, 277L, 290L, 294L, 282L, 283L, 287L, 289L,
285L, 277L, 275L, 281L, 284L, 287L, 290L, 276L, 279L, 282L,
285L, 288L, 291L, 294L, 293L, 275L, 278L, 281L, 284L, 287L,
290L, 293L, 275L, 278L, 292L, 295L, 288L, 277L, 279L, 282L,
285L, 277L, 280L, 283L, 286L, 289L, 292L, 295L, 284L, 287L,
290L, 293L, 279L, 293L, 275L, 278L, 281L, 277L, 291L, 293L,
276L, 286L, 289L, 295L, 279L, 287L, 275L, 281L, 284L, 292L,
278L, 287L, 280L, 284L, 288L, 276L), abs = c(0.083, 0.105,
0.098, 0.093, 0.089, 0.086, 0.206, 0.221, 0.212, 0.109, 0.196,
0.19, 0.181, 0.436, 0.231, 0.224, 0.415, 0.397, 0.379, 0.361,
0.341, 0.124, 0.12, 0.114, 0.109, 0.105, 0.1, 0.096, 0.111,
0.124, 0.116, 0.113, 0.131, 0.108, 0.1, 0.123, 0.122, 0.115,
0.112, 0.118, 0.131, 0.136, 0.106, 0.1, 0.096, 0.093, 0.131,
0.128, 0.122, 0.117, 0.114, 0.11, 0.106, 0.086, 0.122, 0.118,
0.113, 0.109, 0.105, 0.103, 0.099, 0.114, 0.111, 0.099, 0.095,
0.103, 0.119, 0.116, 0.111, 0.107, 0.056, 0.055, 0.051, 0.048,
0.046, 0.043, 0.043, 0.144, 0.139, 0.135, 0.13, 0.153, 0.099,
0.097, 0.093, 0.088, 0.157, 0.102, 0.076, 0.159, 0.082, 0.078,
0.074, 0.119, 0.108, 0.122, 0.116, 0.11, 0.08, 0.097, 0.086,
0.094, 0.089, 0.084, 0.1), a = c(19.1149, 24.1815, 22.5694,
21.4179, 20.4967, 19.8058, 47.4418, 50.8963, 48.8236, 25.1027,
45.1388, 43.757, 41.6843, 100.4108, 53.1993, 51.5872, 95.5745,
91.4291, 87.2837, 83.1383, 78.5323, 28.5572, 27.636, 26.2542,
25.1027, 24.1815, 23.03, 22.1088, 25.5633, 28.5572, 26.7148,
26.0239, 30.1693, 24.8724, 23.03, 28.3269, 28.0966, 26.4845,
25.7936, 27.1754, 30.1693, 31.3208, 24.4118, 23.03, 22.1088,
21.4179, 30.1693, 29.4784, 28.0966, 26.9451, 26.2542, 25.333,
24.4118, 19.8058, 28.0966, 27.1754, 26.0239, 25.1027, 24.1815,
23.7209, 22.7997, 26.2542, 25.5633, 22.7997, 21.8785, 23.7209,
27.4057, 26.7148, 25.5633, 24.6421, 12.8968, 12.6665, 11.7453,
11.0544, 10.5938, 9.9029, 9.9029, 33.1632, 32.0117, 31.0905,
29.939, 35.2359, 22.7997, 22.3391, 21.4179, 20.2664, 36.1571,
23.4906, 17.5028, 36.6177, 18.8846, 17.9634, 17.0422, 27.4057,
24.8724, 28.0966, 26.7148, 25.333, 18.424, 22.3391, 19.8058,
21.6482, 20.4967, 19.3452, 23.03)), .Names = c("bod", "CPOM",
"Nutrient", "day", "wl", "abs", "a"), row.names = c(151L, 378L,
382L, 385L, 388L, 391L, 406L, 522L, 525L, 618L, 774L, 777L, 780L,
1105L, 1126L, 1129L, 1474L, 1477L, 1480L, 1483L, 1486L, 1656L,
1659L, 1662L, 1665L, 1668L, 1671L, 1674L, 2717L, 2832L, 2835L,
2838L, 2949L, 2963L, 3052L, 3143L, 3150L, 3169L, 3184L, 3194L,
3213L, 3293L, 3724L, 3727L, 3730L, 3733L, 4582L, 4585L, 4588L,
4591L, 4594L, 4597L, 4600L, 4829L, 5003L, 5006L, 5009L, 5012L,
5015L, 5018L, 5021L, 5541L, 5544L, 5807L, 5810L, 6127L, 6185L,
6200L, 6203L, 6206L, 6637L, 7006L, 7009L, 7012L, 7015L, 7018L,
7021L, 7157L, 7524L, 7527L, 7530L, 7881L, 7884L, 7975L, 7978L,
7981L, 8366L, 8369L, 8480L, 8486L, 8594L, 8597L, 8603L, 8844L,
8852L, 8961L, 8967L, 8970L, 9158L, 9268L, 9304L, 9310L, 9314L,
9345L, 9389L), class = "data.frame")
推荐答案
这是典型的拆分-应用-组合类型的操作:
This is a typical split-apply-combine type operation:
- 将数据拆分为相关的块,在这种情况下,先为
day
,然后为bod
- 对每个数据块应用一个函数;在这种情况下,线性模型提取斜率系数,
- 将斜率参数和聚合数据组合到一个汇总输出对象(通常是一个数据帧)中.
有很多方法可以做到这一点,特别是说使用 plyr 和最近使用的 dplyr 软件包,但我仍然觉得基本的R是手动版本有启发性的.
There are many ways to do this, especially say using the plyr and more recently the dplyr packages, but I still feel a base R, by-hand version is instructive.
首先,将数据拆分.在这里,我使用 split()
和 spl
最终是一个数据帧列表,每个组合的 day
和 bod 代码>:
First, split the data up. Here I use split()
and spl
ends up being a list of data frames, one per combination of day
and bod
:
spl <- with(spec275, split(spec275, list(day = day, bod = bod)))
现在,我们希望包装函数适合 lm()
模型并返回斜率系数.该包装器执行以下操作:
Now we want a wrapper function to fit the lm()
model and return the slope coefficient. This wrapper does that:
coefLM <- function(x) {
coef(lm(log(a) ~ wl, data = x))[2]
}
在这种情况下,我们通过 sapply()
进行应用,但是 vapply()
会更安全,它是 spl
.
We apply, in this case via sapply()
, but vapply()
would be safer, the wrapper function to each component of spl
.
coefs <- sapply(spl, coefLM)
我们得到一个斜率向量:
We get back a vector of slopes:
head(coefs)
> head(coefs)
2014-06-10.1.wl 2014-06-12.1.wl 2014-06-17.1.wl 2014-06-24.1.wl 2014-07-01.1.wl
-0.01456420 -0.01285560 -0.01554979 -0.01446049 -0.01478953
2014-06-10.2.wl
-0.01545840
然后需要使用 day
和 bod
的唯一组合将此向量放回到对象中.
It is then required to put this vector back into an object with the unique combinations of day
and bod
.
如果您希望同时进行所有操作,请执行以下操作:
If you want all this together, do:
spl <- with(spec275, split(spec275, list(day = day, bod = bod)))
out <- unique(spec275[, c("day", "bod")])
out <- transform(out, slope = sapply(spl, coefLM))
> out
day bod slope
151 2014-07-01 1 -0.01456420
406 2014-06-12 1 -0.01285560
1105 2014-06-10 1 -0.01554979
1656 2014-06-24 1 -0.01446049
2717 2014-06-17 1 -0.01478953
3143 2014-06-12 2 -0.01545840
3724 2014-06-10 2 -0.01359000
4582 2014-06-17 2 -0.01197854
5003 2014-06-24 2 -0.01157193
5807 2014-07-01 2 -0.01237455
6637 2014-06-10 3 -0.01652268
7157 2014-06-12 3 -0.01177575
7884 2014-06-17 3 -0.01194247
7975 2014-07-01 3 -0.01388849
9158 2014-06-24 3 -0.01365501
这篇关于如何更有效地计算数据帧的多个子集的斜率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!