PageRank和它的数学:说明需要 [英] Pagerank and its mathematics: Explanation needed

查看:119
本文介绍了PageRank和它的数学:说明需要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣开发一个搜索引擎从我的国家,索引网页的学生。我一直在研究算法,现在使用了一段时间,我已经确定HITS和PageRank是最好的在那里。我决定去的PageRank,因为它比HITS算法更稳定(或让我读)。

I am a student interested in developing a search engine that indexes pages from my country. I have been researching algorithms to use for sometime now and I have identified HITS and PageRank as the best out there. I have decided to go with PageRank since it is more stable than the HITS algorithm (or so I have read).

我已经找到了无数的文章和相关的PageRank的学术论文,但我的问题是,我不明白,大多数形成算法在这些论文的数学符号。具体而言,我不明白是怎么谷歌矩阵(束缚,随机矩阵)进行计算。

I have found countless articles and academic papers related to PageRank, but my problem is that I don't understand most of the mathematical symbols that form the algorithm in these papers. Specifically, I don't understand how the Google Matrix (the irreducible,stochastic matrix) is calculated.

我的理解是基于这两篇文章:

My understanding is based on these two articles:

  • <一个href="http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf">http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf
  • <一个href="http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf">http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf
  • http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf
  • http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf

可能有人提供了一个基本的解释(例子将是很好)用更少的数学符号?

Could someone provide a basic explanation (examples would be nice) with less mathematical symbols?

在此先感谢。

推荐答案

的PageRank的正式确定指标,因为在引用文件第4页定义,是pssed与滑稽的E的数学公式前$ P $符号(它实际上是资本西格玛希腊字母,适马是字母S,它在这里代表的求和的)。

The formal defintion of PageRank, as defined at page 4 of the cited document, is expressed in the mathematical equation with the funny "E" symbol (it is in fact the capital Sigma Greek letter. Sigma is the letter "S" which here stands for Summation).

在简单地说这个公式说,以计算网页的PageRank的X ...

In a nutshell this formula says that to calculate the PageRank of page X...


   For all the backlinks to this page  (=all the pages that link to X)
   you need to calculate a value that is
         The PageRank of the page that links to X    [R'(v)]
         divided by 
         the number of links found on this page.    [Nv]
         to which you add
           some "source of rank",  [E(u)] normalized by c
             (we'll get to the purpose of that later.)

     And you need to make the sum of all these values [The Sigma thing]
     and finally, multiply it by a constant   [c] 
        (this constant is just to keep the range of PageRank manageable)

的核心思想是这个公式是链接到指定页X的所有网页都增加了价值,它的身价。通过链接到一些网页,他们都赞成这个页面的投票。然而,这种投票有或多或少的权重,这取决于两个因素:

The key idea being this formula is that all web pages that link to a given page X are adding to value to its "worth". By linking to some page they are "voting" in favor of this page. However this "vote" has more or less weight, depending on two factors:

  • 页面链接到X [R'(V)]的
  • 的普及
  • 在该链接到X上的页面还链接到许多其他的网页或不是事实。 [Nv的]

这两个因素反映很直观的想法:

These two factors reflect very intuitive ideas:

  • 这是一般更好地得到推荐信,从一个公认的专家在该领域不是从一个不知名的人。
  • 不管谁给的建议,也通过其他人给的建议,他们正在减少他们的建议,你的价值。

正如你注意到,这个公式是利用有些循环引用的,因为要知道X的页面范围,你需要知道的所有链接页面的PageRank的为X.​​那怎么办算起来这些的PageRank值?......那是在文档踢的部分,在那里汇聚的下一个问题解释说。

As you notice, this formula makes use of somewhat of a circular reference, because to know the page range of X, you need to know the PageRank of all pages linking to X. Then how do you figure these PageRank values?... That's where the next issue of convergence explained in the section of the document kick in.

从本质上讲,通过启动与一些随机(或preferably体面的猜测的PageRank值,所有页面,并通过计算与公式的PageRank以上,新的计算值,得到更好,因为你重复这个过程几次,值汇聚,也就是说,它们每一个越来越接近到什么是真正的/中的理论值。因此,通过迭代的次数足够量的,我们到达了片刻额外时迭代不会增加任何实际的precision由最后一次迭代中提供的值。

Essentially, by starting with some "random" (or preferably "decent guess" values of PageRank, for all pages, and by calculating the PageRank with the formula above, the new calculated values get "better", as you iterate this process a few times. The values converge, i.e. they each get closer and closer to what is the actual/theorical value. Therefore by iterating a sufficient amount of times, we reach a moment when additional iterations would not add any practical precision to the values provided by the last iteration.

现在......这是好的和花花公子,在理论上。诀窍是这个算法转换为等值的东西,但它可以更快速地完成。有迹象表明,描述如何,以及类似的任务,可以做多篇论文。我没有离手这样的引用,但会在以后添加这些。当心他们这样做将涉及线性代数的一个健康的剂量。

Now... That is nice and dandy, in theory. The trick is to convert this algorithm to something equivalent but which can be done more quickly. There are several papers that describe how this, and similar tasks, can be done. I don't have such references off-hand, but will add these later. Beware they do will involve a healthy dose of linear algebra.

编辑:作为承诺,这里有关于算法来计算网页排名的几个环节。 1999年的PageRank Haveliwala高效计算 /// 开拓网络的计算PR Kamvar等人的块结构2003 /// 快速两级算法来计算PageRank的李某等人。 2002

as promised, here are a few links regarding algorithms to calculate page rank. Efficient Computation of PageRank Haveliwala 1999 /// Exploiting the Block Structure of the Web for Computing PR Kamvar etal 2003 /// A fast two-stage algorithm for computing PageRank Lee et al. 2002

尽管许多上面提供的链接的作者是斯坦福大学,它并不需要很长时间认识到,追求高效的PageRank样的计算是研究的一个热点领域。我知道这种材料超出OP的范围,但以暗示的事实,基本的算法是不实际的大腹板是重要的。

Although many of the authors of the links provided above are from Stanford, it doesn't take long to realize that the quest for efficient PageRank-like calculation is a hot field of research. I realize this material goes beyond the scope of the OP, but it is important to hint at the fact that the basic algorithm isn't practical for big webs.

要完成一个十分便利的文本(但有许多环节要深入信息),我想提维基百科的优异文章

To finish with a very accessible text (yet with many links to in-depth info), I'd like to mention Wikipedia's excellent article

如果你是认真对待这样的事情,你可以考虑一个介绍/进修班的数学,特别是线性代数,以及计算机科学类,处理一般的图形。顺便说一句,伟大的建议,从迈克尔·多尔夫曼,在这篇文章中,对1806年的演讲OCW的视频。

If you're serious about this kind of things, you may consider an introductory/refresher class in maths, particularly linear algebra, as well a computer science class that deal with graphs in general. BTW, great suggestion from Michael Dorfman, in this post, for OCW's video of 1806's lectures.

我希望这有助于有点...

I hope this helps a bit...

这篇关于PageRank和它的数学:说明需要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆