实现用麻preduce的PageRank [英] Implementing PageRank using MapReduce

查看:171
本文介绍了实现用麻preduce的PageRank的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让我的头围绕着一个问题,实施与马preduce PageRank的理论。

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.

我有以下简单的场景有三个节点:ABç

I have the following simple scenario with three nodes: A B C.

邻接矩阵是在这里:

A { B, C }
B { A }

中的PageRank为B为例等于:

The PageRank for B for example is equal to:

(1-d)/N + d ( PR(A) / C(A) ) 

N     = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A)  = number of outgoing links from page A

我很好的所有图表,以及如何映射器和减速会工作,但我不能让我的头围绕如何在计算由减速的时候,C(A)将是已知的。如何将减速机,通过聚合导入链接到B就知道每个页面导出链接数量计算B的PageRank的时候。这是否需要一些外部数据源的查找?

I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?

推荐答案

下面是一个伪code:

Here is a pseudocode:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

这在降低你应该输出outlinks和不反向链接,作为对intenret一些文章建议是重要的。以此方式,连续迭代也将有outlinks作为mapper的输入。

It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

注重多个outlinks与来自相同页数为一体的同一地址。此外,不计循环(页链接到自己)。

Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

阻尼因素是传统的0.85,虽然可以玩弄其他值了。

The damping factor is traditionally 0.85, although you can play around with other values, too.

这篇关于实现用麻preduce的PageRank的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆