从 R 中的因子分析中获取非标准化因子分数 [英] Obtain unstandardized factor scores from factor analysis in R

查看:91
本文介绍了从 R 中的因子分析中获取非标准化因子分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 factanal() 对 R 中的几个变量进行因子分析(但我愿意使用其他包).我想确定每个案例的因子分数,但我希望因子分数是非标准化的并且基于输入变量的原始指标.当我运行因子分析并获得因子分数时,它们被标准化为均值 = 0、SD = 1 的正态分布,并且不在输入变量的原始度量上.如何获得与输入变量具有相同度量的非标准化因子分数?理想情况下,这意味着具有相似的均值、标准差、范围和分布.

I'm conducting a factor analysis of several variables in R using factanal() (but am open to using other packages). I want to determine each case's factor score, but I want the factor scores to be unstandardized and on the original metric of the input variables. When I run the factor analysis and obtain the factor scores, they are standardized with a normal distribution of mean=0, SD=1, and are not on the original metric of the input variables. How can I obtain unstandardized factor scores that have the same metric as the input variables? Ideally, this would mean a similar mean, sd, range, and distribution.

我之前问过一个类似的问题,但受访者的回答涉及重新调整标准化(即正态分布)因子分数.请注意,我不想将标准化因子得分转换为非标准化因子得分,因为我的指标的分布是非正态的(即,标准化因子得分的正态分布无法轻松转换为指标的原始度量).换句话说,我想在指标的原始指标上估算未标准化的因素得分,而不是先在标准化指标上进行估算.

I asked a similar question previously, but the respondent's answer involved rescaling standardized (i.e., normally distributed) factor scores. Note that I don't want to transform standardized factor scores to unstandardized ones because the distributions of my indicators are non-normal (i.e., the normal distribution of standardized factor scores cannot be easily transformed to the raw metric of my indicators). In other words, I'd like to estimate unstandardized factor scores on the raw metric of the indicators without first estimating them on a standardized metric.

另外,还有一些缺失的数据.如何获得所有案例的(非标准化)因子分数,即使是那些没有所有项目数据的案例?

Also, there are some missing data. How can I obtain (unstandardized) factor scores for all cases, even those who don't have data on all items?

这是一个小例子:

library(psych)

v1 <- c(1,1,1,NA,1,1,1,1,1,1,3,3,3,3,3,4,5,6)
v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5)
v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6)
v4 <- c(3,3,4,3,3,1,1,2,NA,1,1,1,2,1,1,5,6,4)
v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5)
v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,NA,2,1,6,5,4)
m1 <- cbind(v1,v2,v3,v4,v5,v6)

m1FactorScores <- factanal(~v1+v2+v3+v4+v5+v6, factors = 1, scores = "Bartlett", na.action="na.exclude")$scores

describe(m1) #means~2.3, sds~1.5
describe(m1FactorScores) #mean=0, sd=1

以上数据只是一个小例子.我的实际数据不是likert/ordinal 数据.它们是来自各种来源的足球运动员传球码数的预测.我希望潜在平均值"比平均值更准确地预测球员的传球码数,因为它会丢弃每个来源的独特偏差.然而,这些数据是高度正偏的,并且迫使潜在变量及其因子分数服从正态分布会导致许多球员的数值高得难以置信(例如,下赛季传球超过 6,000 码).

The data above are just a small example. My actual data are not likert/ordinal data. They are forecasts of football players' passing yards from various sources. My hope is that a "latent average" would more accurately forecast players' passing yards than an average because it would discard the unique biases of each source. The data are highly positively skewed, however, and forcing the latent variable and its factor scores to be normally distributed results in implausibly high values for many players (e.g., over 6,000 yards passing next season).

推荐答案

问题是:你上一个问题的答案还是正确的.是预先固定潜在变量的尺度还是重新调整标准化变量的尺度都无关紧要,因为结果分数将是相同的.

The problem is: The answer to your previous question is still correct. Whether you fix the scale of the latent variable in advance or rescale the standardized variable is irrelevant, because the resulting scores will be the same.

这是一个使用 lavaan 的说明,包括两个选项.据我所知,factanal 不支持修复因子加载和截距:

Here is an illustration using lavaan, including both options. Fixing the factor loadings and intercepts isn't supported in factanal as far as I know:

library(lavaan)

v1 <- c(1,1,1,2,1,1,1,1,1,1,3,3,3,3,3,4,5,6)
v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5)
v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6)
v4 <- c(3,3,4,3,3,1,1,2,NA,1,1,1,2,1,1,5,6,4)
v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5)
v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,NA,2,1,6,5,4)
m1 <- data.frame(v1,v2,v3,v4,v5,v6)

# Option 1: fixing the scale according to v1
mean(v1) # 2.278
var(v1)  # 2.448

fix.model <- "f1 =~ v1 + v2 + v3 + v4 + v5 + v6
              f1 ~ 2.278*1
              f1 ~~ 2.448*f1"

fix.fit <- lavaan(fix.model, data = m1, meanstructure=TRUE, missing="fiml", 
                  int.ov.free = TRUE, int.lv.free = TRUE, auto.var = TRUE)

# Option 2: fixing the scale to standardize the latent variable
std.model <- "f1 =~ v1 + v2 + v3 + v4 + v5 + v6
              f1 ~ 0*1
              f1 ~~ 1*f1"

std.fit <- lavaan(std.model, data = m1, meanstructure=TRUE, missing="fiml", 
                  int.ov.free = TRUE, int.lv.free = TRUE, auto.var = TRUE)

# extract scores
fix.scores <- predict(fix.fit)[,1]
std.scores <- predict(std.fit)[,1]
rescaled <- std.scores * sd(v1) + mean(v1)

注意 fix.scoresrescaled 分数之间惊人的相似之处.

Notice the striking similarities between the fix.scores and the rescaled scores.

cbind(std.scores, rescaled, fix.scores)

#      std.scores  rescaled fix.scores
# [1,] -0.8220827 0.9916157  0.9917591
# [2,] -0.8113431 1.0084179  1.0085627
# [3,] -0.8098929 1.0106869  1.0108318
# [4,] -0.5844884 1.3633359  1.3635066

出于模型拟合的目的,为潜在变量选择的比例完全是任意的.潜在变量(即正态)和指标变量(即条件正态)的分布假设是相同的,无论您的选择如何,也与指标的实际分布无关.

For the purposes of model fitting, the scale chosen for the latent variable is completely arbitrary. The distributional assumptions for the latent variable (i.e., normal) and the indicator variables (i.e., conditionally normal) are the same regardless of your choice and regardless of the actually distribution of your indicators.

如果您的指标违反了模型的分布假设,那么这将反映为模型拟合不佳或收敛速度缓慢,但不会反映在产生不同结果的两种方法中.

If your indicators violate the distributional assumptions of the model, then this will be reflected in poor model fit or slow convergence, but not in the two approaches yielding different results.

这篇关于从 R 中的因子分析中获取非标准化因子分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆