如何读取边缘列表以制作稀疏矩阵 [英] How to read in an edge list to make a scipy sparse matrix

查看:139
本文介绍了如何读取边缘列表以制作稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件,其中每行有一对8个字符串.像这样:

I have a large file where each line has a pair of 8 character strings. Something like:

ab1234gh iu9240gh

在每一行上.

此文件实际上代表一个图形,每个字符串都是一个节点ID.我想读入文件,然后直接创建一个稀疏的稀疏邻接矩阵.然后,我将使用python中提供的许多工具之一在此矩阵上运行PCA

This file really represents a graph and each string is a node id. I would like to read in the file and directly make a scipy sparse adjacency matrix. I will then run PCA on this matrix using one of the many tools available in python

是否有一种巧妙的方法来做到这一点?还是我需要先在RAM中制作一个图形,然后将其转换为稀疏矩阵?由于文件很大,因此,如果可能的话,我希望避免中间步骤.

Is there a neat way to do this or do I need to first make a graph in RAM and then convert that into a sparse matrix? As the file is large I would like to avoid intermediate steps if possible.

最终,我将稀疏邻接矩阵输入

Ultimately I will feed the sparse adjacency matrix into http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD .

推荐答案

我认为这是sklearn中的常规任务,因此程序包中必须有一些工具可以执行此操作,或者是其他SO问题的答案.我们需要添加正确的标签.

I think this is a regular task in sklearn, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.

但是仅凭我对numpysparse的了解,我会做些什么:

But just working from my knowledge of numpy and sparse, where what I'd do:

制作一个样本2d数组-N行,2列带有字符值:

Make a sample 2d array - N rows, 2 columns with character values:

In [638]: A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')])
In [639]: A
Out[639]: 
array([['a', 'b'],
       ['b', 'd'],
       ['a', 'd'],
       ['b', 'c'],
       ['d', 'e']], 
      dtype='<U1')

使用np.unique标识唯一的字符串,并作为奖励从这些字符串到原始数组的映射.这是任务的主力军.

Use np.unique to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.

In [640]: k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
In [641]: k1
Out[641]: 
array(['a', 'b', 'c', 'd', 'e'], 
      dtype='<U1')
In [642]: k2
Out[642]: array([0, 1, 7, 3, 9], dtype=int32)
In [643]: k3
Out[643]: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)

我可以调整该inverse数组的形状,以标识A中每个条目的行和列.

I can reshape that inverse array to identify the row and col for each entry in A.

In [644]: rows,cols=k3.reshape(A.shape).T
In [645]: rows
Out[645]: array([0, 1, 0, 1, 3], dtype=int32)
In [646]: cols
Out[646]: array([1, 3, 3, 2, 4], dtype=int32)

构造一个稀疏矩阵是很简单的,该稀疏矩阵在每个交点"处都具有1.

In [648]: M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
In [649]: M
Out[649]: 
<4x5 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>
In [650]: M.A
Out[650]: 
array([[0, 1, 0, 1, 0],
       [0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1]])

第一行a在第二和第四行bd中具有值.等等.

the first row, a has values in the 2nd and 4th col, b and d. and so on.

============================

============================

原来我有:

In [648]: M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))

这是错误的. data数组的形状应与rowscols匹配.在这里它没有引发错误,因为k1恰好具有相同的大小.但是,如果使用不同的混合,则唯一值可能会引发错误.

This is wrong. The data array should match rows and cols in shape. Here it didn't raise an error because k1 happens to have the same size. But with a different mix unique values could raise an error.

===================

====================

此方法假定可以将整个数据库A加载到内存中. unique可能需要类似的内存使用量.最初,coo矩阵可能不会增加内存使用量,因为它将使用提供的数组作为参数.但是,任何计算和/或转换为csr或其他格式的文件都会产生更多副本.

This approach assumes the whole data base, A can be loaded into memory. unique probably requires similar memory usage. Initially a coo matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to csr or other format will make further copies.

我可以想象通过分块加载数据库并使用其他结构来获取唯一值和映射来解决内存问题.您甚至可以从块中构造一个coo矩阵.但是迟早您会遇到内存问题. scikit代码将为该稀疏矩阵制作一个或多个副本.

I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a coo matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.

这篇关于如何读取边缘列表以制作稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆