使用int列表进行稀疏矩阵切片 [英] Sparse matrix slicing using list of int

查看:163
本文介绍了使用int列表进行稀疏矩阵切片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在基于巨大的&机器语言编写机器学习算法稀疏数据(我的矩阵的形状为(347,5 416 812 801),但非常稀疏,只有0.13%的数据为非零.

I'm writing a machine learning algorithm on huge & sparse data (my matrix is of shape (347, 5 416 812 801) but very sparse, only 0.13% of the data is non zero.

我的稀疏矩阵的大小是105000字节(<1Mbytes),是csr类型.

My sparse matrix's size is 105 000 bytes (<1Mbytes) and is of csr type.

我正在尝试通过为每个训练样本集选择示例索引列表来分离训练/测试集. 所以我想使用:p将我的数据集一分为二

I'm trying to separate train/test sets by choosing a list of examples indices for each. So I want to split my dataset in two using :

training_set = matrix[train_indices]

形状(len(training_indices), 5 416 812 801),仍然稀疏

testing_set = matrix[test_indices]

形状(347-len(training_indices), 5 416 812 801)

也稀疏

of shape (347-len(training_indices), 5 416 812 801) also sparse

具有training_indicestesting_indicesint中的两个list

但是training_set = matrix[train_indices]似乎失败并返回Segmentation fault (core dumped)

这可能不是内存问题,因为我正在具有64GB RAM的服务器上运行此代码.

It might not be a problem of memory, as I'm running this code on a server with 64Gbytes of RAM.

关于可能是什么原因的任何线索?

Any clue on what could be the cause ?

推荐答案

我认为我已经使用以下方法重新创建了csr行索引:

I think I've recreated the csr row indexing with:

def extractor(indices, N):
   indptr=np.arange(len(indices)+1)
   data=np.ones(len(indices))
   shape=(len(indices),N)
   return sparse.csr_matrix((data,indices,indptr), shape=shape)

在我闲逛的csr上进行测试:

Testing on a csr I had hanging around:

In [185]: M
Out[185]: 
<30x40 sparse matrix of type '<class 'numpy.float64'>'
    with 76 stored elements in Compressed Sparse Row format>

In [186]: indices=np.r_[0:20]

In [187]: M[indices,:]
Out[187]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

In [188]: extractor(indices, M.shape[0])*M
Out[188]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

与许多其他csr方法一样,它使用矩阵乘法来产生最终值.在这种情况下,所选行中的稀疏矩阵为1.时间实际上好一点.

As with a number of other csr methods, it uses matrix multiplication to produce the final value. In this case with a sparse matrix with 1 in selected rows. Time is actually a bit better.

In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop

在您的情况下,提取器矩阵的形状为(len(train_indices),347),只有len(training_indices)值.所以不大.

In your case the extractor matrix is (len(training_indices),347) in shape, with only len(training_indices) values. So it is not big.

但是,如果matrix太大(或至少第二维是如此之大),以致在矩阵乘法例程中产生一些错误,则可能会引起分段错误,而无需python/numpy对其进行捕获.

But if the matrix is so large (or at least the 2nd dimension so big) that it produces some error in the matrix multiplication routines, it could give rise to segmentation fault without python/numpy trapping it.

matrix.sum(axis=1)是否起作用.尽管使用了1s的密集矩阵,但它也使用了矩阵乘法.还是sparse.eye(347)*M,类似的大小矩阵乘法?

Does matrix.sum(axis=1) work. That too uses a matrix multiplication, though with a dense matrix of 1s. Or sparse.eye(347)*M, a similar size matrix multiplication?

这篇关于使用int列表进行稀疏矩阵切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆