如何读取边缘列表以制作稀疏矩阵 [英] How to read in an edge list to make a scipy sparse matrix
问题描述
我有一个大文件,其中每行有一对8个字符串.像这样:
I have a large file where each line has a pair of 8 character strings. Something like:
ab1234gh iu9240gh
在每一行上.
此文件实际上代表一个图形,每个字符串都是一个节点ID.我想读入文件,然后直接创建一个稀疏的稀疏邻接矩阵.然后,我将使用python中提供的许多工具之一在此矩阵上运行PCA
This file really represents a graph and each string is a node id. I would like to read in the file and directly make a scipy sparse adjacency matrix. I will then run PCA on this matrix using one of the many tools available in python
是否有一种巧妙的方法来做到这一点?还是我需要先在RAM中制作一个图形,然后将其转换为稀疏矩阵?由于文件很大,因此,如果可能的话,我希望避免中间步骤.
Is there a neat way to do this or do I need to first make a graph in RAM and then convert that into a sparse matrix? As the file is large I would like to avoid intermediate steps if possible.
Ultimately I will feed the sparse adjacency matrix into http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD .
推荐答案
我认为这是sklearn
中的常规任务,因此程序包中必须有一些工具可以执行此操作,或者是其他SO问题的答案.我们需要添加正确的标签.
I think this is a regular task in sklearn
, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.
但是仅凭我对numpy
和sparse
的了解,我会做些什么:
But just working from my knowledge of numpy
and sparse
, where what I'd do:
制作一个样本2d数组-N行,2列带有字符值:
Make a sample 2d array - N rows, 2 columns with character values:
In [638]: A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')])
In [639]: A
Out[639]:
array([['a', 'b'],
['b', 'd'],
['a', 'd'],
['b', 'c'],
['d', 'e']],
dtype='<U1')
使用np.unique
标识唯一的字符串,并作为奖励从这些字符串到原始数组的映射.这是任务的主力军.
Use np.unique
to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.
In [640]: k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
In [641]: k1
Out[641]:
array(['a', 'b', 'c', 'd', 'e'],
dtype='<U1')
In [642]: k2
Out[642]: array([0, 1, 7, 3, 9], dtype=int32)
In [643]: k3
Out[643]: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)
我可以调整该inverse
数组的形状,以标识A
中每个条目的行和列.
I can reshape that inverse
array to identify the row and col for each entry in A
.
In [644]: rows,cols=k3.reshape(A.shape).T
In [645]: rows
Out[645]: array([0, 1, 0, 1, 3], dtype=int32)
In [646]: cols
Out[646]: array([1, 3, 3, 2, 4], dtype=int32)
用
构造一个稀疏矩阵是很简单的,该稀疏矩阵在每个交点"处都具有1
.
In [648]: M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
In [649]: M
Out[649]:
<4x5 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
In [650]: M.A
Out[650]:
array([[0, 1, 0, 1, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]])
第一行a
在第二和第四行b
和d
中具有值.等等.
the first row, a
has values in the 2nd and 4th col, b
and d
. and so on.
============================
============================
原来我有:
In [648]: M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))
这是错误的. data
数组的形状应与rows
和cols
匹配.在这里它没有引发错误,因为k1
恰好具有相同的大小.但是,如果使用不同的混合,则唯一值可能会引发错误.
This is wrong. The data
array should match rows
and cols
in shape. Here it didn't raise an error because k1
happens to have the same size. But with a different mix unique values could raise an error.
===================
====================
此方法假定可以将整个数据库A
加载到内存中. unique
可能需要类似的内存使用量.最初,coo
矩阵可能不会增加内存使用量,因为它将使用提供的数组作为参数.但是,任何计算和/或转换为csr
或其他格式的文件都会产生更多副本.
This approach assumes the whole data base, A
can be loaded into memory. unique
probably requires similar memory usage. Initially a coo
matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to csr
or other format will make further copies.
我可以想象通过分块加载数据库并使用其他结构来获取唯一值和映射来解决内存问题.您甚至可以从块中构造一个coo
矩阵.但是迟早您会遇到内存问题. scikit代码将为该稀疏矩阵制作一个或多个副本.
I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a coo
matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.
这篇关于如何读取边缘列表以制作稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!