Python-csr_matrix的数据结构 [英] Python - Data structure of csr_matrix
问题描述
我正在研究TFIDF.我使用了 tfidf_vectorizer.fit_transform .它返回一个csr_matrix,但是我不明白结果的结构是什么.
I am studying about TFIDF. I have used tfidf_vectorizer.fit_transform. It return a csr_matrix, but I can not understand what structure of the result.
- 数据输入:
documents =(天空是蓝色",太阳是明亮的", 天空是明亮的",我们可以看到灿烂的阳光,灿烂的阳光")
documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun, the bright sun" )
- 声明:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix)
- 结果:
(0,9)0.34399327143
(0,7)0.519713848879
(0,4)0.420753151645
(0,0)0.659191117868
(1,9)0.426858009784
(1,4)0.522108621994
(1,8)0.522108621994
(1,1)0.522108621994
(2,9)0.526261040111
(2,7)0.397544332095
(2,4)0.32184639876
(2,8)0.32184639876
(2,1)0.32184639876
(2,3)0.504234576856
(3,9)0.390963088213
(3,8)0.47820398015
(3,1)0.239101990075
(3,10)0.374599471224
(3,2)0.374599471224
(3,5)0.374599471224
(3,6)0.374599471224
(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
(0, 0) 0.659191117868
(1, 9) 0.426858009784
(1, 4) 0.522108621994
(1, 8) 0.522108621994
(1, 1) 0.522108621994
(2, 9) 0.526261040111
(2, 7) 0.397544332095
(2, 4) 0.32184639876
(2, 8) 0.32184639876
(2, 1) 0.32184639876
(2, 3) 0.504234576856
(3, 9) 0.390963088213
(3, 8) 0.47820398015
(3, 1) 0.239101990075
(3, 10) 0.374599471224
(3, 2) 0.374599471224
(3, 5) 0.374599471224
(3, 6) 0.374599471224
tfidf_matrix 是一个csr_matrix.因此,我发现了这一点,但没有与结果相同的结构:
tfidf_matrix is a csr_matrix. So I find on this, but there are no structure as same as the result: scipy.sparse.csr_matrix
(0,9)0.34399327143的值的结构是什么?
What structure of value as (0, 9) 0.34399327143 ?
推荐答案
没有向量化,我可以使用以下操作序列或多或少地重新创建矩阵:
Without the vectorize I can recreate the matrix, more or less, with this sequence of operations:
In [703]: documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun" )
获取单词列表的列表(全部小写):
get a list of lists of the words (all lower case):
In [704]: alist = [l.lower().split() for l in documents]
获取单词的排序列表(唯一):
get a sorted list of the words (unique):
In [705]: aset = set()
In [706]: [aset.update(l) for l in alist]
Out[706]: [None, None, None, None]
In [707]: unq = sorted(list(aset))
In [708]: unq
Out[708]:
['blue',
'bright',
'can',
'in',
'is',
'see',
'shining',
'sky',
'sun',
'the',
'we']
浏览alist
并收集字数统计. rows
是句子编号,cols
是唯一单词索引
Go through the alist
and collect word counts. rows
will be the sentence number, cols
will be the unique word index
In [709]: rows, cols, data = [],[],[]
In [710]: for i,row in enumerate(alist):
...: for c in row:
...: rows.append(i)
...: cols.append(unq.index(c))
...: data.append(1)
...:
从此数据中生成稀疏矩阵:
Make a sparse matrix from this data:
In [711]: M = sparse.csr_matrix((data,(rows,cols)))
In [712]: M
Out[712]:
<4x11 sparse matrix of type '<class 'numpy.int32'>'
with 21 stored elements in Compressed Sparse Row format>
In [713]: print(M)
(0, 0) 1
(0, 4) 1
(0, 7) 1
(0, 9) 1
(1, 1) 1
....
(3, 9) 2
(3, 10) 1
In [714]: M.A # viewed as 2d array
Out[714]:
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
[0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],
[0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)
由于它使用的是sklearn
,因此我可以使用以下方法重现您的矩阵:
Since this is using sklearn
, I can reproduce your matrix with:
In [717]: from sklearn import feature_extraction
In [718]: tf = feature_extraction.text.TfidfVectorizer()
In [719]: tfM = tf.fit_transform(documents)
In [720]: tfM
Out[720]:
<4x11 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
In [721]: print(tfM)
(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
....
(3, 5) 0.374599471224
(3, 6) 0.374599471224
In [722]: tfM.A
Out[722]:
array([[ 0.65919112, 0. , 0. , 0. , 0.42075315,
0. , 0. , 0.51971385, 0. , 0.34399327,
0. ],....
[ 0. , 0.23910199, 0.37459947, 0. , 0. ,
0.37459947, 0.37459947, 0. , 0.47820398, 0.39096309,
0.37459947]])
实际数据存储为3个属性数组:
The actual data is stored as 3 attribute arrays:
In [723]: tfM.indices
Out[723]:
array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1,
10, 2, 5, 6], dtype=int32)
In [724]: tfM.data
Out[724]:
array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801,
...
0.37459947])
In [725]: tfM.indptr
Out[725]: array([ 0, 4, 8, 14, 21], dtype=int32)
各个行的indices
值告诉我们该句子中出现了哪些单词:
The indices
values for individual rows tell us which words occur in that sentence:
In [726]: np.array(unq)[M[0,].indices]
Out[726]:
array(['blue', 'is', 'sky', 'the'],
dtype='<U7')
In [727]: np.array(unq)[M[3,].indices]
Out[727]:
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],
dtype='<U7')
这篇关于Python-csr_matrix的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!