获取维基百科文章中的第一个链接,不在括号内 [英] Get the first link in a Wikipedia article not inside parentheses

查看:162
本文介绍了获取维基百科文章中的第一个链接,不在括号内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有兴趣这个理论,如果你去随机维基百科的文章,点击不重复括号内的第一个链接,在95%的情况下,您将最终关于哲学

So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.

我想在Python中编写一个脚本,为我提供链接,最后打印一个很好的列表,查看哪些文章( linkA - > linkB - > linkC )等。

I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited (linkA -> linkB -> linkC) etc.

我设法获取网页的HTML DOM,并管理删除一些不必要的链接和顶部描述栏导致消歧页。到目前为止,我得出结论:

I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:


  • DOM从一些页面上右侧的表开始,例如人类。我们想忽略这些链接。

  • 有效的链接元素都有一个< p> 元素作为其祖先(最常见父母或祖父母,如果它位于< b> 标签或类似内容中,导致消歧页面的顶部栏似乎不包含任何 ; p> 元素

  • 无效的链接包含一些特殊字符,后面跟一个冒号,例如维基百科:

  • The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
  • The valid link elements all have a <p> element somewhere as their ancestor (most often parent or grandparent if it's inside a <b> tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any <p> elements.
  • Invalid links contain some special words followed by a colon, e.g. Wikipedia:

到目前为止,这很好,但是这是括号,让我感觉到在人类,例如,括号内的第一个链接是/ wiki / Species,但脚本找到/ wiki /分类法

So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.

我不知道如何以编程方式进行,因为我必须在父/子节点的某些组合中查找文本,这可能并不总是任何想法?

I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?

我的代码可以在下面看到,但它'我做的事情真的很快,不是很自豪。然而,它被评论,所以你可以看到我的想法(我希望:))。

My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).

"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time

def validWikiArticleLinkString(href):
    """ Takes a string and returns True if it contains the substring
        '/wiki/' in the beginning and does not contain any of the
        "special" wiki pages. 
    """
    return (href.find("/wiki/") == 0
            and href.find("(disambiguation)") == -1 
            and href.find("File:") == -1 
            and href.find("Wikipedia:") == -1
            and href.find("Portal:") == -1
            and href.find("Special:") == -1
            and href.find("Help:") == -1
            and href.find("Template_talk:") == -1
            and href.find("Template:") == -1
            and href.find("Talk:") == -1
            and href.find("Category:") == -1
            and href.find("Bibcode") == -1
            and href.find("Main_Page") == -1)


if __name__ == "__main__":
    visited = []    # a list of visited links. used to avoid getting into loops

    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api

    currentPage = "Human"  # the page to start with

    while True:
        infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
        html = infile.read()    # retrieve the contents of the wiki page we are at

        htmlDOM = parseString(html) # get the DOM of the parsed HTML
        aTags = htmlDOM.getElementsByTagName("a")   # find all <a> tags

        for tag in aTags:
            if "href" in tag.attributes.keys():         # see if we have the href attribute in the tag
                href = tag.attributes["href"].value     # get the value of the href attribute
                if validWikiArticleLinkString(href):                             # if we have one of the link types we are looking for

                    # Now come the tricky parts. We want to look for links in the main content area only,
                    # and we want the first link not in parentheses.

                    # assume the link is valid.
                    invalid = False            

                    # tables which appear to the right on the site appear first in the DOM, so we need to make sure
                    # we are not looking at a <a> tag somewhere inside a <table>.
                    pn = tag.parentNode                     
                    while pn is not None:
                        if str(pn).find("table at") >= 0:
                            invalid = True
                            break
                        else:
                            pn = pn.parentNode 

                    if invalid:     # go to next link
                        continue               

                    # Next we look at the descriptive texts above the article, if any; e.g
                    # This article is about .... or For other uses, see ... (disambiguation).
                    # These kinds of links will lead into loops so we classify them as invalid.

                    # We notice that this text does not appear to be inside a <p> block, so
                    # we dismiss <a> tags which aren't inside any <p>.
                    pnode = tag.parentNode
                    while pnode is not None:
                        if str(pnode).find("p at") >= 0:
                            break
                        pnode = pnode.parentNode
                    # If we have reached the root node, which has parentNode None, we classify the
                    # link as invalid.
                    if pnode is None:
                        invalid = True

                    if invalid:
                        continue


                    ######  this is where I got stuck:
                    # now we need to look if the link is inside parentheses. below is some junk

#                    for elem in tag.parentNode.childNodes:
#                        while elem.firstChild is not None:
#                            elem = elem.firstChid
#                        print elem.nodeValue

                    print href      # this will be the next link
                    newLink = href[6:]  # except for the /wiki/ part
                    break

        # if we have been to this link before, break the loop
        if newLink in visited:
            print "Stuck in loop."
            break
        # or if we have reached Philosophy
        elif newLink == "Philosophy":
            print "Ended up in Philosophy."
            break
        else:
            visited.append(currentPage)     # mark this currentPage as visited
            currentPage = newLink           # make the the currentPage we found the new page to fetch
            time.sleep(5)                   # sleep some to see results as debug


推荐答案

我在Github上找到了一个python脚本( http://github.com/ JensTimmerman / scripts / blob / master / philosophy.py )来玩这个游戏。
它使用Beautifulsoup进行HTML解析,并处理parantheses问题,他只需在解析链接之前删除括号之间的文本。

I found a python script on Github (http://github.com/JensTimmerman/scripts/blob/master/philosophy.py) to play this game. It uses Beautifulsoup for HTML parsing and to cope with the parantheses issue he just removes text between brackets before parsing links.

这篇关于获取维基百科文章中的第一个链接,不在括号内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆