如何使用Scipy处理庞大的稀疏矩阵构造? [英] How to handle huge sparse matrices construction using Scipy?

查看:163
本文介绍了如何使用Scipy处理庞大的稀疏矩阵构造?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在研究Wikipedia转储,以计算大约570万个页面的页面排名. 这些文件已经过预处理,因此不是XML格式.
它们取自 http://haselgrove.id.au/wikipedia.htm 格式为:

So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take. The files are preprocessed and hence are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm and the format is:

from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)

等等.所以.基本上,这是一个[5,700,000*5,700,000]矩阵的构造,它将破坏我的4个RAM.由于非常稀疏,因此使用scipy.lil.sparsescipy.dok.sparse可以更轻松地进行存储,现在我的问题是:

so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse or scipy.dok.sparse, now my issue is:

我到底该如何将具有链接信息的.txt文件转换为稀疏矩阵?读取它并将其作为普通的N * N矩阵进行计算,然后将其转换还是什么?我不知道.

How on earth do I go about converting the .txt file with the link information to a sparse matrix? Read it and compute it as a normal N*N matrix then convert it or what? I have no idea.

此外,链接有时会跨越行,那么处理该问题的正确方法是什么?
例如:随机线就像..

Also, the links sometimes span across lines so what would be the correct way to handle that?
eg: a random line is like..

[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]

完全像这样:没有逗号&没有定界符.

exactly like this: no commas & no delimiters.

有关稀疏矩阵构造和跨行数据处理的任何信息都将有所帮助.

Any information on sparse matrix construction and data handling across lines would be helpful.

推荐答案

Scipy提供了几种稀疏矩阵的实现.它们每个都有自己的优点和缺点.您可以在此处找到有关矩阵格式的信息:

Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information about the matrix formats here:

有几种方法可以获取所需的稀疏矩阵.由于内存需求很高(大约10 ^ 12个条目!),因此可能无法计算完整的NxN矩阵然后进行转换.

There are several ways to get to your desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due high memory requirements (about 10^12 entries!).

在您的情况下,我将准备您的数据以构建

In your case I would prepare your data to construct a coo_matrix.

coo_matrix((data, (i, j)), [shape=(M, N)])

data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries

您可能还想看看

You might also want to have a look at lil_matrix, which can be used to incrementally build your matrix.

创建矩阵后,您可以将其转换为更适合计算的格式,具体取决于您的用例.

Once you created the matrix you can then convert it to a better suited format for calculation, depending on your use case.

我无法识别数据格式,可能有解析器,可能没有.但是,编写自己的解析器应该不会很困难.每行包含一个冒号的行都会开始一个新行,该冒号之后的连续索引以及不包含冒号的连续行中的所有索引都是该行的列条目.

I do not recognize the data format, there might be parsers for it, there might not. Writing your own parser should not be very difficult, though. Each line containing a colon starts a new row, all indices after the colon and in consecutive lines without colons are the column entries for said row.

这篇关于如何使用Scipy处理庞大的稀疏矩阵构造?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆