对于网页排名的有效实施过渡矩阵 [英] Efficient implementation of the transition matrix for page rank

查看:228
本文介绍了对于网页排名的有效实施过渡矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现的PageRank。我在这里读的描述:的http:// NLP .stanford.edu / IR-书/ HTML / htmledition /马尔可夫链,1.HTML

I'm trying to implement PageRank. I'm reading the description here: http://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html

一切都非常清楚,我,但我很担心矩阵$ P $的建设。我发现,构建$ P $用简单的方式将非常昂贵。例如:实施步骤1中,人会需要检查$ A $每一行,然后检查该行的每一个元素,看是否所有的元素都为零。用于步骤2中的一个将需要计算中的一些为每一行的数目。我能想象我的code有讨厌的缓慢循环。我在想,如果有聪明的线性代数技术,可以有效地构造$ P $。我将使用python numpy的为我的编码。

Everything is very clear to me, however I'm concerned about the construction of the matrix $P$. I find that constructing $P$ the naive way would be very expensive. For example: to implement step 1, one would need to check every row of $A$ and then check every element of that row to see if all elements are zero. For step 2 one would need to compute the number of ones for each row. I can imagine my code to have nasty slow loops. I was wondering if there are smart linear algebra techniques that could efficiently construct $P$. I will be using python numpy for my coding.

编辑:一个办法,我想现在要解决,这是做一个求和部件明智超过$列A $。由我会的列向量。现在,我将通过这个向量的每一个元素,检查哪些元素是零。因此,我现在可以知道哪些行没有1秒,我可以乘这些行以$ 1 / N $。

one way I'm thinking now to solve this is by doing a summation element wise over the columns of $A$. By that I would have a column vector. Now I will go through each element of this vector to check which elements are zeros. Thus I can now know which rows has no 1s and I can multiply those rows with $1/N$.

推荐答案

您的关注是正确的。由于网页(顶点重新presenting图)的数量是巨大的,也不可能真正产生这样的 A 和它的工作。

Your concern is correct. Since the number of web pages (vertices in the representing graph) is huge, it is impossible to actually generate such A and work on it.

矩阵计算的网页排名可以更有效地利用稀疏矩阵实现计算的,因为基质是非常稀疏的。大多数网页实际上不彼此连接,因此在基体最条目都为0

The matrix calculation of page rank can be much more efficiently calculated using sparse matrix implementations, since the matrix is very sparse. Most webpages are not actually connected to each other, so most entries in the matrix are 0.

稀疏矩阵构建如下:

  • 在构建矩阵A所描述的 A_ij = 1 如果(I,J)是一个优势,否则<$ C C $> A_ij = 0
  • 在第1步通常是不发,而是我们除去'汇'反复。这样做是为了prevent基质是致密,一些替代品也联'下沉'回到那个链接到这些,或水槽链接到本身的节点。
  • 在(2)
  • 描述将每个1的
  • Build matrix A as described A_ij = 1 if (i,j) is an edge, otherwise A_ij = 0
  • Step 1 is usually not made, and instead we remove 'sinks' iteratively. This is done to prevent the matrix being dense, some alternatives are also linking 'sinks' back to the nodes that linked to them, or link a sink to itself.
  • Divide each 1 in A as described in (2)

让我们来表示结果矩阵为 M ,这是结果矩阵对本会工作,为了得到一个列向量 P> / code>(与1 / n的初始化,每个条目)。

Let's denote the resulting matrix as M, and this is the resulting matrix we will work on, in order to get a column vector p (which is initialized with 1/n for each entry).

x = [1/n, 1/n, ... , 1/n]^T //a column vector
p = [1/n, 1/n, ... , 1/n]^T //a column vector with the initial ranks
M = genSparseMatrix() //as described above
do until p converge:
    p = (1-\alpha)* M*p + (\alpha) * x
return p

在最后,这会产生 P ,保存每个节点的网页排名值的列向量。

In the end, this yields p, the column vector that holds the page rank value for each node.

这篇关于对于网页排名的有效实施过渡矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆