将模型拟合到多个分组或子集,并为数据框输出提取原始因子列 [英] fit model to multiple groupings or subsets and extract original factor columns for data frame output

查看:21
本文介绍了将模型拟合到多个分组或子集,并为数据框输出提取原始因子列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想拟合模型并提取按分组因子(下面的 fac1 和 fac2)或子集划分的特定参数.我的问题是,当 sapply 输出正确的参数时,我被困在一个列表中,其中元素被命名为组合.我想要得到的是一个 data.frame,其中每个因素都有一个带有适当标签的列.我想在 base R 中执行此操作.

请注意,答案必须是一般性,而不是在这种情况下使用的特定名称.如果因子名称包含句点",则答案不应该受到阻碍.我最终会使用任何数据来制作一些东西,所以这个答案需要这样做,并且还有许多因素.我实际上是在更大的数据集上使用自定义函数,但这个例子代表了我的问题.

以下是可重现的代码:

#创建数据fac1 <- c(rep("A", 10), rep("B",10))fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)x <- 代表(1:5,4)设置种子(1337)y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)xy <- data.frame(x,y) #绑定回归参数因子 <- list(fac1, fac2) #split by 2 个因子sapply(split(xy, factor), function(c) coef(lm(c$y~c$x))[2])#对这4组进行回归,拉出斜率

输出为:

A.X.c$x B.X.c$x A.Y.c$x B.Y.c$x1.861290 2.131431 1.590733 1.746169

我想要的是:

fac1 fac2 斜率A X 1.861290乙 X 2.131431A Y 1.590733由 Y 1.746169

为了实现这一点,以下代码可能更通用,但我担心 expand.grid 进行所有可能的组合但用户的数据中缺少组合的情况,以及订单是否会保持不变相同的.expand.grid 是否使用了类似的方法,但是分割决定返回值顺序的数据的子集?

slopes <- sapply(split(xy, factor), function(c) coef(lm(c$y~c$x))[2])dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2)))dataframeplz$slope <- 斜率数据框

如果有帮助,这里是 plyr 解决方案.这很简单,但不是基于 R.有人知道在 Hadley 的代码中这个魔法发生在哪里吗?GitHub?

library("plyr")整洁数据 <- data.frame(fac1,fac2,x,y)ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

解决方案

对于基础 R,aggregate 是这种情况下用户友好的功能.

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])

<前>fac1 fac2 斜率1 A X 1.8612902 B X 2.1314313 A Y 1.5907334 B Y 1.746169

这也可以通过 by 以更类似于您的原始方式来完成.

setNames(as.data.frame.table(by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),c("fac1","fac2","斜率"))

I want to fit models and pull out specific parameters split by grouping factors (fac1 and fac2 below) or subsets. My problem is that when sapply outputs the correct parameters, I'm stuck with a list where the elements are named as combinations. What I want to get is a data.frame where I have a column for each factor with the appropriate label. I want to do this in base R.

Notice, the answer needs to be general and not for the specific names used in this case. The answer shouldn't be hindered if factor names include 'periods.' I'm eventually making something to use with any data, so this answer needs to do so, and also with any number of factors. I am actually using a custom function on a much larger data set but this example represents my issue.

Following is reproducible code:

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 
#run regression by these 4 groups, pull out slope

The output is:

A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
1.861290 2.131431 1.590733 1.746169

What I want is:

fac1 fac2 slope
A    X    1.861290 
B    X    2.131431 
A    Y    1.590733 
B    Y    1.746169

The following code might be made to be more general to accomplish this, but I'm worried about cases where expand.grid makes all possible combinations but the user has missing combinations in their data, and also whether the order will stay the same. Does expand.grid use a similar method as however split subsets the data that determines the order of the returned values?

slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 

dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2))) 

dataframeplz$slope <- slopes

dataframeplz

Here is the plyr solution if that helps. It's so easy but not base R. Anyone know where in Hadley's code this magic happens? Githubbers?

library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

解决方案

For base R, aggregate is the user friendly function for such situations.

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])

  fac1 fac2    slope
1    A    X 1.861290
2    B    X 2.131431
3    A    Y 1.590733
4    B    Y 1.746169

This could also be done with by in a fashion a bit more similar to your original.

setNames(as.data.frame.table(
  by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),
  c("fac1","fac2","slope"))

这篇关于将模型拟合到多个分组或子集,并为数据框输出提取原始因子列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆