使模型适合多个分组或子集,并提取原始因子列以输出数据框 [英] fit model to multiple groupings or subsets and extract original factor columns for data frame output

查看:85
本文介绍了使模型适合多个分组或子集,并提取原始因子列以输出数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想拟合模型并提取按分组因子(下面的fac1和fac2)或子集划分的特定参数.我的问题是,当sapply输出正确的参数时,我陷入了一个列表,其中的元素被命名为组合.我想得到的是一个data.frame,其中每个因素都有一个带有适当标签的列.我想在 base R 中执行此操作.

I want to fit models and pull out specific parameters split by grouping factors (fac1 and fac2 below) or subsets. My problem is that when sapply outputs the correct parameters, I'm stuck with a list where the elements are named as combinations. What I want to get is a data.frame where I have a column for each factor with the appropriate label. I want to do this in base R.

注意,答案必须是一般,而不是这种情况下使用的特定名称.如果因子名称包含句点",那么答案就不会受到阻碍.我最终将使任何数据都可以使用,因此此答案需要这样做,并且还要考虑许多因素.我实际上是在更大的数据集上使用自定义函数,但是此示例代表了我的问题.

Notice, the answer needs to be general and not for the specific names used in this case. The answer shouldn't be hindered if factor names include 'periods.' I'm eventually making something to use with any data, so this answer needs to do so, and also with any number of factors. I am actually using a custom function on a much larger data set but this example represents my issue.

以下是可复制的代码:

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 
#run regression by these 4 groups, pull out slope

输出为:

A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
1.861290 2.131431 1.590733 1.746169

我想要的是:

fac1 fac2 slope
A    X    1.861290 
B    X    2.131431 
A    Y    1.590733 
B    Y    1.746169

下面的代码可能更通用,但是我担心的情况是expand.grid可以进行所有可能的组合,但用户的数据中缺少组合,以及订单是否会保留相同的. expand.grid是否使用类似的方法,但是拆分子集的数据确定返回值的顺序?

The following code might be made to be more general to accomplish this, but I'm worried about cases where expand.grid makes all possible combinations but the user has missing combinations in their data, and also whether the order will stay the same. Does expand.grid use a similar method as however split subsets the data that determines the order of the returned values?

slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 

dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2))) 

dataframeplz$slope <- slopes

dataframeplz

这是 plyr 解决方案,如果有帮助的话.这很容易,但是不是R的基数.有人知道哈德利的代码在哪里发生了这种魔术吗? Githubbers?

Here is the plyr solution if that helps. It's so easy but not base R. Anyone know where in Hadley's code this magic happens? Githubbers?

library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

推荐答案

对于基数R,aggregate是此类情况下的用户友好功能.

For base R, aggregate is the user friendly function for such situations.

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])


  fac1 fac2    slope
1    A    X 1.861290
2    B    X 2.131431
3    A    Y 1.590733
4    B    Y 1.746169

这也可以用by来完成,其方式与您的原版更加相似.

This could also be done with by in a fashion a bit more similar to your original.

setNames(as.data.frame.table(
  by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),
  c("fac1","fac2","slope"))

这篇关于使模型适合多个分组或子集,并提取原始因子列以输出数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆