创建一个稀疏矩阵;给定用于创建大数据集的分类列的虚拟变量的非零元素的索引 [英] create a sparse matrix; given the indices of non-zero elements for creation of dummy variables of a categorical column of a large dataset

查看:198
本文介绍了创建一个稀疏矩阵;给定用于创建大数据集的分类列的虚拟变量的非零元素的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用一个稀疏矩阵为一组580万行和两个分类列的数据生成虚拟变量。



数据的结构是:
$ b $ m

mydata:data.table 5,800,000行和两个分类(以整数格式)变量Var1和Var2

< (水平包括1到21万之间的所有数字)



nlevel(Var2):500(水平包括1到500之间的所有数字)



以下是mydata的一个例子:

  Var_1 Var_2 
1 4
1 2
2 7
5 9
5 500




200 6
200 2
200 80



我使用稀疏矩阵(sparse_Mx)来创建虚拟变量矩阵,表单:

  Var_1 Var_2_level_1 Var_2_level_2。 。 。 Var_2_level_500 
1 0 1 0
2 0 0 0
3 1 1 0
4 0 0 0
5 0 0 1





200 0 1 0




210,000 ... ... ...

我知道如何有效地做到这一点,所以我用了一个for-loop来创建一个虚拟变量矩阵:
$ b $ pre $ library $(Matrix)对于(i in 1:nrow(mydata))$ b $(对于稀疏矩阵
m2< - 矩阵(0,nrow = 210000,ncol = 500,sparse = TRUE))

b sparse_Mx [mydata [i,Var_1],mydata [i,Var_2]] < - 1

它基本上遍历每一行mydata,并根据行Var1的值(决定矩阵中的行)和行Var2的值(决定矩阵中的列号,用1填充稀疏矩阵。 >

它可以工作,除非它永远占用(for循环必须经过5,800,000个循环!)

任何方式更有效地做到这一点?
我真的不喜欢使用for循环为此目的,但不能想办法做到这一点。






编辑:我想补充一点,我已经尝试使用sparse.model.matrix(),无济于事。生成的矩阵不是正确的格式(210,000行和500列)。



这些变量转换为因子,并使用以下内容:

pre $ sp_mx< - sparse.model.matrix(〜。-1,data = mydata)

我得到一个[5,800,000 x 500]的稀疏矩阵,而不是矩阵[210,000 x 500] b
$ b

我已经尝试了很多变化,结果仍然相同: p>

  sp_mx < -  sparse.model.matrix(〜Var2 -1,data = mydata)

  sp_mx < -   - 稀疏.model.matrix(Var1〜Var2 -1,data = mydata)

所有这些都会导致与所有行稀疏矩阵。
我需要的是一个[210,000 x 500]矩阵,每行将有不止一个1。 >试试这个:

  spmat< -Matrix(0,nrow = 210000,ncol = 500,sparse = T)
locs <-Matrix(data = c(mydata $ Var_1,mydata $ Var_2),byrow = F,ncol = 2)
spmat [locs] = 1
pre>

I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns.

The structure of the data is:

mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2

nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000)

nlevel(Var2) : 500 (levels include all numbers between 1 and 500)

here's an example of mydata:

 Var_1      Var_2
   1          4
   1          2
   2          7
   5          9
   5          500
   .
   .
   .

  200         6
  200         2
  200         80
   .
   .
   .

I'm using a sparse Matrix (sparse_Mx) to create the dummy variable matrix which would be of the form:

Var_1       Var_2_level_1     Var_2_level_2   . . .    Var_2_level_500
  1                0                   1                    0
  2                0                   0                    0
  3                1                   1                    0
  4                0                   0                    0
  5                0                   0                    1

  .
  .
  .

 200              0                    1                    0
  .
  .
  .

210,000           ...                 ...                  ...

I didn't know how to do this efficiently, so i used a for-loop to create the dummy variable matrix:

library(Matrix) #for sparse matrices
m2 <- Matrix(0, nrow = 210000, ncol = 500 , sparse = TRUE) 

for (i in 1: nrow(mydata))
  sparse_Mx[ mydata[i, Var_1] , mydata[i, Var_2] ] <- 1

It basically goes through each row of mydata, and based on the row Var1 value (which determines the row in the matrix) and the row Var2 value (which determines the column number in the matrix, fills the sparse matrix with 1.

It works, except it's taking forever (as the for-loop has to go through 5,800,000 loops!)

Is there any way to do this more efficiently? I really dislike using for-loop for this purpose but couldn't think of another way to do this.


Edit: I'd like to add that I have tried using sparse.model.matrix(), to no avail. the generated matrix is not in the right format (210,000 rows and 500 columns).

The variables were converted to factors and used the following:

sp_mx <- sparse.model.matrix( ~ . -1 , data = mydata)

However, I get a sparse matrix of [5,800,000 x 500 ] as opposed to a matrix of [210,000 x 500]

I've tried many variations and still same result:

sp_mx <- sparse.model.matrix( ~ Var2 -1 , data = mydata)

or

 sp_mx <- sparse.model.matrix(Var1 ~ Var2 -1 , data = mydata)

all of them result in a sparse matrix with all rows. what i need is a [210,000 x 500] matrix that will have more than one 1 in each row.

解决方案

Try this:

spmat<-Matrix(0,nrow = 210000 ,ncol = 500,sparse = T)
locs<-Matrix(data=c(mydata$Var_1,mydata$Var_2),byrow=F,ncol=2)
spmat[locs]=1

这篇关于创建一个稀疏矩阵;给定用于创建大数据集的分类列的虚拟变量的非零元素的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆