手动设置稀疏矩阵形状的含义 [英] Implications of manually setting scipy sparse matrix shape

查看:83
本文介绍了手动设置稀疏矩阵形状的含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对TF-IDF模型进行在线培训.我发现scipy的 TfidfVectorizer不支持在线时尚培训,因此我正在实现自己的CountVectorizer以支持在线培训,然后在输入预定义数量的文档后使用scipy的TfidfTransformer更新tf-idf值.语料库.

我在此处发现了您不应在numpy数组中添加行或列,因为所有数据都需要复制,以便将其存储在连续的内存块中.

但是后来我也发现,实际上,使用scipy稀疏矩阵,您可以手动更改矩阵的形状.. >

Numpy 重塑文档说:

并非总是可以在不复制数据的情况下更改数组的形状.如果要在复制数据时引发错误,则应将新形状分配给数组的shape属性

由于稀疏矩阵的重塑"是通过分配新形状来完成的,因此可以安全地说未复制数据吗?这样做的含义是什么?效率高吗?

代码示例:

matrix = sparse.random(5, 5, .2, 'csr') # Create (5,5) sparse matrix
matrix._shape = (6, 6) # Change shape to (6, 6)
# Modify data on new empty row

我还想扩展我的问题,询问诸如vstack之类的方法,该方法允许此问题之后,我实现了一种方法来更改稀疏矩阵中的一行的值.

现在,将添加新的空行的想法与修改现有值的想法混合在一起,我得出以下结论:

matrix = sparse.random(5, 3, .2, 'csr')
matrix._shape = (6, 3)
# Update indptr to let it know we added a row with nothing in it.
matrix.indptr = np.hstack((matrix.indptr, matrix.indptr[-1]))

# New elements on data, indices format
new_elements = [1, 1]
elements_indices = [0, 2] 

# Set elements for new empty row
set_row_csr_unbounded(matrix, 5, new_elements, elements_indices)

在相同的执行过程中,我多次运行了上面的代码,但没有出现错误.但是,一旦我尝试添加新列(那么就无需更改indptr),当我尝试更改值时就会收到错误消息.为什么会发生这种情况?

好吧,由于set_row_csr_unbounded在下面使用numpy.r_,所以我认为最好使用lil_matrix.即使所有元素添加后也无法修改.我说的对吗?

我认为lil_matrix会变得更慢,因为我认为numpy.r_正在复制数据.

解决方案

numpy中,reshape表示以保持相同数字元素的方式更改shape.因此形状项的乘积不变.

最简单的例子是

np.arange(12).reshape(3,4)

分配方法是:

x = np.arange(12)
x.shape = (3,4)

method(或np.reshape(...))返回一个新数组. shape分配就地工作.

文档注释表明您在执行类似操作时会起作用

x = np.arange(12).reshape(3,4).T
x.reshape(3,4)   # ok, but copy
x.shape = (3,4)  # raises error

为更好地了解此处发生的情况,请在不同阶段打印阵列,并查看原始0,1,2,...连续性如何变化. (这不是读者的练习,因为它对更大的问题并不重要.)

有一个resize函数和方法,但使用不多,并且它在查看和复制方面的行为很棘手.

np.concatenate(以及类似np.stacknp.vstack的变体)创建新数组,并复制输入中的所有数据.

列表(和对象dtype数组)包含指向元素(可能是数组)的指针,因此不需要复制数据.

稀疏矩阵将其数据(和行/列索引)存储在格式不同的各种属性中. coocsrcsc具有3个1d数组. lil有2个包含列表的对象数组. dok是字典的子类.

lil_matrix实现reshape方法.其他格式则没有.与np.reshape一样,尺寸的乘积不能更改.

理论上,稀疏矩阵可以以最小的数据复制嵌入"到更大的矩阵中,因为所有新值均为默认值0,并且不占用任何空间.但是该操作的细节尚未针对任何格式进行计算.

sparse.hstacksparse.vstack(不要在稀疏矩阵上使用numpy版本)通过组合输入的coo属性(通过sparse.bmat)来工作.因此,是的,它们创建了新的数组(datarowcol).

制作更大的稀疏矩阵的最小示例:

In [110]: M = sparse.random(5,5,.2,'coo')
In [111]: M
Out[111]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [112]: M.A
Out[112]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ]])
In [113]: M1 = sparse.coo_matrix((M.data, (M.row, M.col)),shape=(7,5))
In [114]: M1
Out[114]: 
<7x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [115]: M1.A
Out[115]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])
In [116]: id(M1.data)
Out[116]: 139883362735488
In [117]: id(M.data)
Out[117]: 139883362735488

MM1具有相同的data属性(相同的数组ID).但是,对这些矩阵进行的大多数操作都需要转换为另一种格式(例如csr用于数学,或lil用于更改值),并且涉及复制和修改属性.因此,两个矩阵之间的这种连接将被破坏.

使用诸如coo_matrix之类的函数制作稀疏矩阵时,如果不提供shape参数,则会从提供的坐标中推断出形状.如果提供shape,它将使用它.该形状必须至少与隐含形状一样大.使用lil(和dok),您可以有利地创建具有较大形状的空"矩阵,然后迭代设置值.您不想使用csr进行此操作.而且您不能直接设置coo值.

创建稀疏矩阵的规范方法是构建datarowcol数组或从各个片段中迭代迭代地列出-使用列表追加/扩展或数组连接,然后创建coo(或csr)格式数组.因此,您甚至在创建矩阵之前就进行了所有的增长"操作.

更改_sh​​ape

制作矩阵:

In [140]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [141]: M
Out[141]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [142]: M.A
Out[142]: 
array([[0, 6, 7],
       [0, 0, 6],
       [1, 0, 5],
       [0, 0, 0],
       [0, 6, 0]])

In [144]: M[1,0] = 10
... SparseEfficiencyWarning)
In [145]: M.A
Out[145]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0]])

您的新形状方法(确保indptrdtype不变):

In [146]: M._shape = (6,3)
In [147]: newptr = np.hstack((M.indptr,M.indptr[-1]))
In [148]: newptr
Out[148]: array([0, 2, 4, 6, 6, 7, 7], dtype=int32)
In [149]: M.indptr = newptr
In [150]: M
Out[150]: 
<6x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [151]: M.A
Out[151]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0,  0]])
In [152]: M[5,2]=10
... SparseEfficiencyWarning)
In [153]: M.A
Out[153]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0, 10]])

添加一列似乎也可以:

In [154]: M._shape = (6,4)
In [155]: M
Out[155]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in Compressed Sparse Row format>
In [156]: M.A
Out[156]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [ 0,  0, 10,  0]])
In [157]: M[5,0]=10
.... SparseEfficiencyWarning)
In [158]: M[5,3]=10
.... SparseEfficiencyWarning)
In [159]: M
Out[159]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [160]: M.A
Out[160]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [10,  0, 10, 10]])

属性共享

我可以从现有矩阵中创建一个新矩阵:

In [108]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [109]: newptr = np.hstack((M.indptr,6))
In [110]: M1 = sparse.csr_matrix((M.data, M.indices, newptr), shape=(6,3))

data为共享属性,至少从视图的角度来看:

In [113]: M[0,1]=14
In [114]: M1[0,1]
Out[114]: 14

但是,如果我通过添加非零值来修改M1:

In [117]: M1[5,0]=10
...
  SparseEfficiencyWarning)

矩阵之间的链接断开:

In [120]: M[0,1]=3
In [121]: M1[0,1]
Out[121]: 14

I need to perform online training on a TF-IDF model. I found that scipy's TfidfVectorizer does not support training on online fashion, so I'm implementing my own CountVectorizer to support online training and then use the scipy's TfidfTransformer to update tf-idf values after a pre-defined number of documents have entered in the corpus.

I found here that you shouldn't be adding rows or columns to numpy arrays since all data would need to be copied so it is stored in contiguous blocks of memory.

But then I also found that in fact, using scipy sparse matrix you can manually change the matrix's shape.

Numpy reshape docs says:

It is not always possible to change the shape of an array without copying the data. If you want an error to be raised when the data is copied, you should assign the new shape to the shape attribute of the array

Since the "reshaping" of the sparse matrix is being done by assigning a new shape, is it safe to say data is not being copied? What are the implications of doing so? Is it efficient?

Code example:

matrix = sparse.random(5, 5, .2, 'csr') # Create (5,5) sparse matrix
matrix._shape = (6, 6) # Change shape to (6, 6)
# Modify data on new empty row

I would also like to expand my question to ask about methods such as vstack that allows one to append arrays to one another (same as adding a row). Is vstack copying the whole data so it gets stored as contiguous blocks of memory as stated in my first link? What about hstack?


EDIT: So, following this question I've implemented a method to alter the values of a row in a sparse matrix.

Now, mixing the idea of adding new empty rows with the idea of modifying existing values I've come up with the following:

matrix = sparse.random(5, 3, .2, 'csr')
matrix._shape = (6, 3)
# Update indptr to let it know we added a row with nothing in it.
matrix.indptr = np.hstack((matrix.indptr, matrix.indptr[-1]))

# New elements on data, indices format
new_elements = [1, 1]
elements_indices = [0, 2] 

# Set elements for new empty row
set_row_csr_unbounded(matrix, 5, new_elements, elements_indices)

I run the above code a few times during the same execution and got no error. But as soon as I try to add a new column (then there would be no need to change indptr) I get an error when I try to alter the values. Any lead on why this happen?

Well, since set_row_csr_unbounded uses numpy.r_ underneath, I assume I'm better using a lil_matrix. Even if all the elements, once added cannot be modified. Am I right?

I think that lil_matrix would be ebtter because I assume numpy.r_ is copying the data.

解决方案

In numpy reshape means to change the shape in such a way that keeps the same number elements. So the product of the shape terms can't change.

The simplest example is something like

np.arange(12).reshape(3,4)

The assignment method is:

x = np.arange(12)
x.shape = (3,4)

The method (or np.reshape(...)) returns a new array. The shape assignment works in-place.

The docs note that you quote comes into play when doing something like

x = np.arange(12).reshape(3,4).T
x.reshape(3,4)   # ok, but copy
x.shape = (3,4)  # raises error

To better understand what's happening here, print the array at different stages, and look at how the original 0,1,2,... contiguity changes. (that's left as an exercise for the reader since it isn't central to the bigger question.)

There is a resize function and method, but it isn't used much, and its behavior with respect to views and copies is tricky.

np.concatenate (and variants like np.stack, np.vstack) make new arrays, and copy all the data from the inputs.

A list (and object dtype array) contains pointers to the elements (which may be arrays), and so don't require copying data.

Sparse matrices store their data (and row/col indices) in various attributes that differ among the formats. coo, csr and csc have 3 1d arrays. lil has 2 object arrays containing lists. dok is a dictionary subclass.

lil_matrix implements a reshape method. The other formats do not. As with np.reshape the product of the dimensions can't change.

In theory a sparse matrix could be 'embedded' in a larger matrix with minimal copying of data, since all the new values will be the default 0, and not occupy any space. But the details for that operation have not been worked out for any of the formats.

sparse.hstack and sparse.vstack (don't use the numpy versions on sparse matrices) work by combining the coo attributes of the inputs (via sparse.bmat). So yes, they make new arrays (data, row, col).

A minimal example of making a larger sparse matrix:

In [110]: M = sparse.random(5,5,.2,'coo')
In [111]: M
Out[111]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [112]: M.A
Out[112]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ]])
In [113]: M1 = sparse.coo_matrix((M.data, (M.row, M.col)),shape=(7,5))
In [114]: M1
Out[114]: 
<7x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [115]: M1.A
Out[115]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])
In [116]: id(M1.data)
Out[116]: 139883362735488
In [117]: id(M.data)
Out[117]: 139883362735488

M and M1 have the same data attribute (same array id). But most operations on these matrices will require a conversion to another format (such as csr for math, or lil for changing values), and will involve copying and modifying the attributes. So this connection between the two matrices will be broken.

When you make a sparse matrix with a function like coo_matrix, and don't provide a shape parameter, it deduces the shape from the provided coordinates. If you provide a shape it uses that. That shape has to be at least as large as the implied shape. With lil (and dok) you can profitably create an 'empty' matrix with a large shape, and then set values iteratively. You don't want to do that with csr. And you can't directly set coo values.

The canonical way of creating sparse matrices is to build the data, row, and col arrays or lists iteratively from various pieces - with list append/extend or array concatenates, and make a coo (or csr) format array from that. So you do all the 'growing' before even creating the matrix.

changing _shape

Make a matrix:

In [140]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [141]: M
Out[141]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [142]: M.A
Out[142]: 
array([[0, 6, 7],
       [0, 0, 6],
       [1, 0, 5],
       [0, 0, 0],
       [0, 6, 0]])

In [144]: M[1,0] = 10
... SparseEfficiencyWarning)
In [145]: M.A
Out[145]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0]])

your new shape method (make sure the dtype of indptr doesn't change):

In [146]: M._shape = (6,3)
In [147]: newptr = np.hstack((M.indptr,M.indptr[-1]))
In [148]: newptr
Out[148]: array([0, 2, 4, 6, 6, 7, 7], dtype=int32)
In [149]: M.indptr = newptr
In [150]: M
Out[150]: 
<6x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [151]: M.A
Out[151]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0,  0]])
In [152]: M[5,2]=10
... SparseEfficiencyWarning)
In [153]: M.A
Out[153]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0, 10]])

Adding a column also seems to work:

In [154]: M._shape = (6,4)
In [155]: M
Out[155]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in Compressed Sparse Row format>
In [156]: M.A
Out[156]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [ 0,  0, 10,  0]])
In [157]: M[5,0]=10
.... SparseEfficiencyWarning)
In [158]: M[5,3]=10
.... SparseEfficiencyWarning)
In [159]: M
Out[159]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [160]: M.A
Out[160]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [10,  0, 10, 10]])

attribute sharing

I can make a new matrix from an existing one:

In [108]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [109]: newptr = np.hstack((M.indptr,6))
In [110]: M1 = sparse.csr_matrix((M.data, M.indices, newptr), shape=(6,3))

The data attributes a shared, at least in view sense:

In [113]: M[0,1]=14
In [114]: M1[0,1]
Out[114]: 14

But if I modify M1 by adding a nonzero value:

In [117]: M1[5,0]=10
...
  SparseEfficiencyWarning)

The link between the matrices breaks:

In [120]: M[0,1]=3
In [121]: M1[0,1]
Out[121]: 14

这篇关于手动设置稀疏矩阵形状的含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆