如何在R中使用大数据对多个变量进行热编码? [英] How can I one hot encode multiple variables with big data in R?
问题描述
我目前有一个包含260,000行和50列的数据框,其中3列为数字,其余为分类.我想对类别列进行热编码,以执行PCA并使用回归来预测类.如何在R中完成以下示例?
I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R?
Example:
V1 V2 V3 V4 V5 .... VN-1 VN
to
V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on
推荐答案
您可以使用model.matrix
或sparse.model.matrix
.像这样:
You can use model.matrix
or sparse.model.matrix
. Something like this:
sparse.model.matrix(~. -1, data = your_data)
~.
告诉R您的整个表(.
)是某些假设模型的右侧,而-1
说则省略了截距.如果没有-1
,则第一列将是1s的向量.
The ~.
tells R that your entire table (the .
) is the right hand side of some hypothetical model, and the -1
says to leave out the intercept. Without the -1
your first column will be a vector of 1s.
这篇关于如何在R中使用大数据对多个变量进行热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!