sparseMatrix与数值和分类数据 [英] sparseMatrix with numerical and categorical data

查看:131
本文介绍了sparseMatrix与数值和分类数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用数字和分类数据创建一个稀疏矩阵,将其用作cv.glmnet的输入.当只涉及数字数据时,我可以使用以下语法创建sparseMatrix

I am trying to create a sparse matrix with numerical and categorical data which will be used as an input to cv.glmnet. When only numerical data is involved, I can create a sparseMatrix using the following syntax

sparseMatrix(i=c(1,3,5,2), j=c(1,1,1,2), x=c(1,2,4,3), dims=c(5,2))

对于分类变量,以下方法似乎有效:

For categorical variables, the following approach seems to work:

sparse.model.matrix(~-1+automobile, data.frame(automobile=c("sedan","suv","minivan","truck","sedan")))

我的非常稀疏实例具有1,000,000个观察值和10,000个变量.我没有足够的内存来首先创建完整的矩阵.我想到创建sparseMatrix的唯一方法是通过创建列并以(i,j,x)格式转换数据来手动处理分类变量.我希望有人可以提出更好的方法.

My VERY sparse instance has 1,000,000 observations and 10,000 variables. I do not have enough memory to first create the full matrix. The only way I can think of creating a sparseMatrix is to manually handle the categorical variables by creating the columns and converting the data in (i,j,x) format. I am hoping that somebody can suggest a better approach.

推荐答案

这可能行不通,但您可以尝试分别为每个变量创建模型矩阵,然后cBind将它们组合在一起.

This may or may not work, but you could try creating the model matrices for each variable separately and then cBinding them together.

do.call(cBind,
        sapply(names(df), function(x) sparse.model.matrix(~., df[x])[, -1, drop=FALSE]))

请注意,您可能想要创建拦截列然后将其删除,而不是像上面所做的那样在公式中指定-1.后者将删除您的第一个因素的一个级别,但保留其他因素的所有级别,因此这取决于变量的顺序.

Note that you probably want to create the intercept column and then remove it, rather than specifying -1 in the formula as you've done above. The latter will remove one level for your first factor, but keep all the levels for the others, so it depends on the ordering of the variables.

这篇关于sparseMatrix与数值和分类数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆