如何使用FactoMineR软件包以编程方式确定主要成分的列索引? [英] How to programmatically determine the column indices of principal components using FactoMineR package?

查看:181
本文介绍了如何使用FactoMineR软件包以编程方式确定主要成分的列索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个包含混合变量(即分类变量和连续变量)的数据框,例如

Given a data frame containing mixed variables (i.e. both categorical and continuous) like,

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
                 )

我使用软件包FactoMineR

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

变量df.princomp是一个列表.

此后,为了可视化我使用的主要组件 fviz_screeplot()fviz_contrib()之类的

Thereafter, to visualize the principal components I use fviz_screeplot() and fviz_contrib() like,

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))

给出下面的图1

和图2

图1的解释:图1是一个卵形图. Scree图是一个简单的线段图,显示了每个主成分(PC)解释或表示的数据中总方差的分数.因此,我们可以看到前三个PC共同负责总方差的43.8%.现在自然会产生一个问题,这些变量是什么?".我已经在图2中显示了这一点.

Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.

图2的说明:该图将主成分分析(PCA)结果中的行/列的贡献可视化.从这里我可以看到变量namestudLocfinalMark是可以用于进一步分析的最重要的变量.

Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.

进一步分析-我被困在其中:要导出上述变量namestudLocfinalMark的贡献.我使用像df.princomp$quanti.var$contrib[,4]df.princomp$quali.var$contrib[,2:3]这样的主成分变量df.princomp(请参见上文).

Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].

我必须手动指定列索引[,2:3][,4].

I've to manually specify the column indices [,2:3] and [,4].

我想要的:我想知道如何进行动态列索引分配,这样我就不必手动对列表df.princomp中的列索引[,2:3]进行编码?

What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?

我已经查看了以下类似问题 1 2 4 ,但是找不到我的解决方案?解决该问题的任何帮助或建议都会有所帮助.

I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.

推荐答案

不确定我对您的问题的解释是否正确,如果不正确,请您道歉.根据我的收集,您正在使用PCA作为初始工具,向您展示哪些变量对解释数据集最重要.然后,您想回到原始数据,快速选择这些变量,而无需每次都进行手动编码,然后将其用于其他分析.

Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.

如果这是正确的,那么我已经保存了贡献图中的数据,滤出了贡献最大的变量,并使用该结果创建了一个仅包含这些变量的新数据框.

If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G

根据您的评论,您说要在Dim.1和Dim.2中查找值大于5的变量并将这些变量保存到新的数据框中",我会这样做:

Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:

#top contributors to both Dim 1 and 2

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1,2), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than 5

r<-rownames(dat[dat$contrib>5,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

(这会将所有原始变量保留在我们的新数据框中,因为它们对总方差的贡献超过5%)

(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)

这篇关于如何使用FactoMineR软件包以编程方式确定主要成分的列索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆