如何加速程序,找到两个维基百科文章之间的最短路径 [英] How to speed up program that finds the shortest path between two wikipedia articles

查看:193
本文介绍了如何加速程序,找到两个维基百科文章之间的最短路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我编写了一个程序,找到两篇维基百科文章之间的最短路径。问题是获取页面上的所有链接并将其放入图表需要很长时间。寻找路径是一件容易的事情。
基本上我在做的是这样的:

  startingPage ='Lisbon'
target ='Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph,startingPage)

找到了!= True :
列表中的节点:$ b​​ $ b如果graph.out_degree(node)== 0:
found = pages.import_page(graph,node)
if == True:
break;

我的import_page函数是这样的:

  def import_page(graph,starting,target):
general_str ='https://en.wikipedia.org/w/api.php?action=query&prop= links& pllimit = max& format = json& titles ='
data_str = general_str + starting
encoded_data_str = data_str.encode('utf-8')#Sanitize input
response = url.urlopen (encoded_data_str)
data = json.loads(response.read())
pageId = data ['query'] ['pages']。keys()
print
如果data ['query'] ['pages'] .keys()[0] =='-1':#检查页面在Wikipedia中是否不存在
return False
elif data [ 'query'] ['pages'] [pageId [0]]。keys()[2]!='links':#检查页面中是否有链接
返回False

for jsonObject in data ['query'] ['pages'] [pageId [0]] ['links']:

graph.add_node(jsonObject ['title ])
graph.add_edge(starting,jsonObject ['title'])

if jsonObject ['title'] == target:
return True

while data.keys()[0]!='batchcomplete':

continueId = data ['continue'] ['plcontinue']
continue_str = data_str +'& plcontinue ='+ continueId
encoded_continue_str = continue_str.encode('utf-8')#Sanitize input
response = url.urlopen(encoded_continue_str)
data = json.loads(response.read() )

用于数据中的jsonObject ['query'] ['pages'] [pageId [0]] ['links']:
graph.add_node(jsonObject ['title'])
graph.add_edge(starting,jsonObject ['title'])
if jsonObject ['title'] == target:
return True

return False

问题在于任何距离大于2/3的距离都会带来巨大的震动没有时间。关于如何加速它的任何想法?

解决方案

我使用@Tgr的方法指出,利用一个小世界。
如果您使用加权网络,则可以将搜索范围限制为足够大的子图,以包含相关的集线器,并且足够小以便在Web RESTful API中处理。



您可能希望查看iGraph模块而不是networkx,从而减少内存占用。



通过我向您建议的方法,能够获得连接最多5条查询维基百科文章的最短路径,实时创建的内存占用高达100MB的子图。两个主题之间的最短路径少于1秒。



我很乐意与我的项目分享一个链接,该链接实际上是为维基百科计算加权知识网络以允许搜索多个主题之间的联系 - 是否会打破SO政策,或者可能对OP有用,并讨论他的问题?

编辑






感谢@Tgr对政策进行汇报。 Nifty.works 是一个寻找跨学科领域之间联系的原型平台。
知识图是Wikidata与英语维基百科配对的一个子集。



作为OP的示例,此示例显示五篇维基百科文章之间查询的最短路径的子图。 p>

我将Wikipedia的知识图形计算为加权网络。
该网络具有小世界属性。
通过对知识图的一部分(子图)进行分隔来对文章之间的连接(路径)进行查询。


通过这种方法即使在小型服务器机器上,也可以足够快地提供图表搜索,以提供知识发现的见解。

这里您可以找到英文维基百科中两篇文章之间最短路径的游戏化示例,每一对的距离大于3个链接 - 也就是说,它们不是第一个邻居:例如机器学习和生活 - 这里是被查询的子图的一个json )。

您甚至可能需要添加参数来调整加权子图的大小,以获得不同的结果。
举个例子,看看两者之间的区别:


  • 机器学习 - 生活:查询加权$ b上的最短路径$ b英语维基百科知识图(小世界网络)的子图(1)

  • 机器学习 - 生活:查询最短路径加权
    英文维基百科(小世界网络)知识图的子图(2)
    b $ b

    最后,也请看这个问题: https://stackoverflow.com/a/16030045/305883

    p>

    I recently coded a program that finds the shortest path between two wikipedia articles. The problem is getting ALL the links from a page and putting them into a graph is taking a long time. Finding the path is the easy part. Basicly what I'm doing is this:

    startingPage = 'Lisbon'
    target = 'Adolf Hitler'
    graph = nx.DiGraph()
    graph.add_node(startingPage)
    found = pages.import_page(graph, startingPage)
    
    while found != True:
        for node in list(graph):
            if graph.out_degree(node) == 0:
                found = pages.import_page(graph, node)
            if found == True:
                break;
    

    And my import_page function is this one:

    def import_page(graph, starting, target):
        general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
        data_str = general_str + starting
        encoded_data_str = data_str.encode('utf-8') #Sanitize input
        response = url.urlopen(encoded_data_str)
        data = json.loads(response.read())
        pageId = data['query']['pages'].keys()
        print starting
        if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
            return False
        elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
            return False
    
        for jsonObject in data['query']['pages'][pageId[0]]['links']:
    
            graph.add_node(jsonObject['title'])
            graph.add_edge(starting, jsonObject['title'])
    
            if jsonObject['title'] == target:
                return True
    
        while data.keys()[0] != 'batchcomplete':
    
            continueId = data['continue']['plcontinue']
            continue_str = data_str + '&plcontinue=' + continueId
            encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
            response = url.urlopen(encoded_continue_str)
            data = json.loads(response.read())
    
            for jsonObject in data['query']['pages'][pageId[0]]['links']:
                graph.add_node(jsonObject['title'])
                graph.add_edge(starting, jsonObject['title'])
                if jsonObject['title'] == target:
                    return True
    
        return False
    

    The problem is for any distance bigger than 2/3 links it is a taking an immense amount of time. Any ideas on how I can speed it up?

    解决方案

    I used an approach as @Tgr point out, exploiting a small world. If you use a weighted network, you can limit the search to a subgraph sufficiently large to encompass relevant hubs, and small enough to be handled in a web RESTful API.

    You may want to check out iGraph module rather networkx, for less memory footprint.

    With the approach I suggested to you I have been able to obtain shortest paths connecting up to 5 queried wikipedia articles, with a memory footprint of up to 100MB of sub-graph created in real time. A shortest path between two topics takes less than 1s.

    I would be happy to share a link to my project, which actually compute a weighted knowledge networks for the wikipedia to allow search for connections between multiple topics - would it break SO policy or could be useful for the OP and discussion over his question?

    EDIT


    Thank you @Tgr for debriefing on the policy.

    Nifty.works is a prototype platform to search for connections between inter-disciplinary fields. The knowledge graph is a subset of Wikidata paired with English Wikipedia.

    As an example for the OP, this example shows shortest paths queried between five Wikipedia articles: subgraph for connections between articles: "Shortest Path Problem", "A star search", "networkx", "knowledge graph" and "semantic network"

    I computed the knowledge graph of Wikipedia as a weighted network. The network has small-world properties. A query for connections (paths) in between of articles is made by delimiting a portion of the knowledge graph (a sub-graph).

    With this approach it is possible to serve a graph search fast enough to provide insights in knowledge discovery, even on small server machines.

    Here you find examples of gamification of shortest paths between two articles of English Wikipedia, each pair has a distance bigger than 3 links - that is, they are not first neighbours: e.g. "Machine Learning" and "Life" -here a json of the queried subgraph).

    You might even want to add parameters to adjust the size of your weighted sub-graph, so to obtain different results. As an example, see the differences between:

    Finally, also look at this question: https://stackoverflow.com/a/16030045/305883

    这篇关于如何加速程序,找到两个维基百科文章之间的最短路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆