Python-csr_matrix的数据结构 [英] Python - Data structure of csr_matrix

查看：80 发布时间：2020/5/18 21:53:05 python numpy scipy scikit-learn

本文介绍了Python-csr_matrix的数据结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究TFIDF.我使用了 tfidf_vectorizer.fit_transform .它返回一个csr_matrix，但是我不明白结果的结构是什么.

I am studying about TFIDF. I have used tfidf_vectorizer.fit_transform. It return a csr_matrix, but I can not understand what structure of the result.

数据输入:

documents =(天空是蓝色"，太阳是明亮的"，天空是明亮的"，我们可以看到灿烂的阳光，灿烂的阳光")

documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun, the bright sun" )

声明:

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix)

结果:

(0，9)0.34399327143
(0，7)0.519713848879
(0，4)0.420753151645
(0，0)0.659191117868
(1，9)0.426858009784
(1，4)0.522108621994
(1，8)0.522108621994
(1，1)0.522108621994
(2，9)0.526261040111
(2，7)0.397544332095
(2，4)0.32184639876
(2，8)0.32184639876
(2，1)0.32184639876
(2，3)0.504234576856
(3，9)0.390963088213
(3，8)0.47820398015
(3，1)0.239101990075
(3，10)0.374599471224
(3，2)0.374599471224
(3，5)0.374599471224
(3，6)0.374599471224

(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
(0, 0) 0.659191117868
(1, 9) 0.426858009784
(1, 4) 0.522108621994
(1, 8) 0.522108621994
(1, 1) 0.522108621994
(2, 9) 0.526261040111
(2, 7) 0.397544332095
(2, 4) 0.32184639876
(2, 8) 0.32184639876
(2, 1) 0.32184639876
(2, 3) 0.504234576856
(3, 9) 0.390963088213
(3, 8) 0.47820398015
(3, 1) 0.239101990075
(3, 10) 0.374599471224
(3, 2) 0.374599471224
(3, 5) 0.374599471224
(3, 6) 0.374599471224

tfidf_matrix 是一个csr_matrix.因此，我发现了这一点，但没有与结果相同的结构:

tfidf_matrix is a csr_matrix. So I find on this, but there are no structure as same as the result: scipy.sparse.csr_matrix

(0，9)0.34399327143的值的结构是什么?

What structure of value as (0, 9) 0.34399327143 ?

推荐答案

没有向量化，我可以使用以下操作序列或多或少地重新创建矩阵:

Without the vectorize I can recreate the matrix, more or less, with this sequence of operations:

In [703]: documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun" )

获取单词列表的列表(全部小写):

get a list of lists of the words (all lower case):

In [704]: alist = [l.lower().split() for l in documents]

获取单词的排序列表(唯一):

get a sorted list of the words (unique):

In [705]: aset = set()
In [706]: [aset.update(l) for l in alist]
Out[706]: [None, None, None, None]
In [707]: unq = sorted(list(aset))
In [708]: unq
Out[708]: 
['blue',
 'bright',
 'can',
 'in',
 'is',
 'see',
 'shining',
 'sky',
 'sun',
 'the',
 'we']

浏览alist并收集字数统计. rows是句子编号，cols是唯一单词索引

Go through the alist and collect word counts. rows will be the sentence number, cols will be the unique word index

In [709]: rows, cols, data = [],[],[]
In [710]: for i,row in enumerate(alist):
     ...:     for c in row:
     ...:         rows.append(i)
     ...:         cols.append(unq.index(c))
     ...:         data.append(1)
     ...:

从此数据中生成稀疏矩阵:

Make a sparse matrix from this data:

In [711]: M = sparse.csr_matrix((data,(rows,cols)))
In [712]: M
Out[712]: 
<4x11 sparse matrix of type '<class 'numpy.int32'>'
    with 21 stored elements in Compressed Sparse Row format>
In [713]: print(M)
  (0, 0)    1
  (0, 4)    1
  (0, 7)    1
  (0, 9)    1
  (1, 1)    1
  ....
  (3, 9)    2
  (3, 10)   1
In [714]: M.A        # viewed as 2d array
Out[714]: 
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],
       [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)

由于它使用的是sklearn，因此我可以使用以下方法重现您的矩阵:

Since this is using sklearn, I can reproduce your matrix with:

In [717]: from sklearn import feature_extraction
In [718]: tf = feature_extraction.text.TfidfVectorizer()
In [719]: tfM = tf.fit_transform(documents)
In [720]: tfM
Out[720]: 
<4x11 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>
In [721]: print(tfM)
  (0, 9)    0.34399327143
  (0, 7)    0.519713848879
  (0, 4)    0.420753151645
  ....
  (3, 5)    0.374599471224
  (3, 6)    0.374599471224
In [722]: tfM.A
Out[722]: 
array([[ 0.65919112,  0.        ,  0.        ,  0.        ,  0.42075315,
         0.        ,  0.        ,  0.51971385,  0.        ,  0.34399327,
         0.        ],....
       [ 0.        ,  0.23910199,  0.37459947,  0.        ,  0.        ,
         0.37459947,  0.37459947,  0.        ,  0.47820398,  0.39096309,
         0.37459947]])

实际数据存储为3个属性数组:

The actual data is stored as 3 attribute arrays:

In [723]: tfM.indices
Out[723]: 
array([ 9,  7,  4,  0,  9,  4,  8,  1,  9,  7,  4,  8,  1,  3,  9,  8,  1,
       10,  2,  5,  6], dtype=int32)
In [724]: tfM.data
Out[724]: 
array([ 0.34399327,  0.51971385,  0.42075315,  0.65919112,  0.42685801,
       ...
        0.37459947])
In [725]: tfM.indptr
Out[725]: array([ 0,  4,  8, 14, 21], dtype=int32)

各个行的indices值告诉我们该句子中出现了哪些单词:

The indices values for individual rows tell us which words occur in that sentence:

In [726]: np.array(unq)[M[0,].indices]
Out[726]: 
array(['blue', 'is', 'sky', 'the'],
      dtype='<U7')
In [727]: np.array(unq)[M[3,].indices]
Out[727]: 
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],
      dtype='<U7')

这篇关于Python-csr_matrix的数据结构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python-csr_matrix的数据结构 [英] Python - Data structure of csr_matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python-csr_matrix的数据结构 [英] Python - Data structure of csr_matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭