获取维基百科文章中的第一个链接，不在括号内 [英] Get the first link in a Wikipedia article not inside parentheses

查看：162 发布时间：2017/6/25 3:03:33 python parsing dom

本文介绍了获取维基百科文章中的第一个链接，不在括号内的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我有兴趣这个理论，如果你去随机维基百科的文章，点击不重复括号内的第一个链接，在95％的情况下，您将最终关于哲学。

So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.

我想在Python中编写一个脚本，为我提供链接，最后打印一个很好的列表，查看哪些文章（ linkA - > linkB - > linkC ）等。

I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited (linkA -> linkB -> linkC) etc.

我设法获取网页的HTML DOM，并管理删除一些不必要的链接和顶部描述栏导致消歧页。到目前为止，我得出结论：

I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:

DOM从一些页面上右侧的表开始，例如人类。我们想忽略这些链接。

有效的链接元素都有一个 元素作为其祖先（最常见父母或祖父母，如果它位于 标签或类似内容中，导致消歧页面的顶部栏似乎不包含任何 ; p> 元素

无效的链接包含一些特殊字符，后面跟一个冒号，例如维基百科：

The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
The valid link elements all have a  element somewhere as their ancestor (most often parent or grandparent if it's inside a  tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any  elements.
Invalid links contain some special words followed by a colon, e.g. Wikipedia:

到目前为止，这很好，但是这是括号，让我感觉到在人类，例如，括号内的第一个链接是/ wiki / Species，但脚本找到/ wiki /分类法

So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.

我不知道如何以编程方式进行，因为我必须在父/子节点的某些组合中查找文本，这可能并不总是任何想法？

I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?

我的代码可以在下面看到，但它'我做的事情真的很快，不是很自豪。然而，它被评论，所以你可以看到我的想法（我希望:)）。

My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).

"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time

def validWikiArticleLinkString(href):
    """ Takes a string and returns True if it contains the substring
        '/wiki/' in the beginning and does not contain any of the
        "special" wiki pages. 
    """
    return (href.find("/wiki/") == 0
            and href.find("(disambiguation)") == -1 
            and href.find("File:") == -1 
            and href.find("Wikipedia:") == -1
            and href.find("Portal:") == -1
            and href.find("Special:") == -1
            and href.find("Help:") == -1
            and href.find("Template_talk:") == -1
            and href.find("Template:") == -1
            and href.find("Talk:") == -1
            and href.find("Category:") == -1
            and href.find("Bibcode") == -1
            and href.find("Main_Page") == -1)


if __name__ == "__main__":
    visited = []    # a list of visited links. used to avoid getting into loops

    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api

    currentPage = "Human"  # the page to start with

    while True:
        infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
        html = infile.read()    # retrieve the contents of the wiki page we are at

        htmlDOM = parseString(html) # get the DOM of the parsed HTML
        aTags = htmlDOM.getElementsByTagName("a")   # find all <a> tags

        for tag in aTags:
            if "href" in tag.attributes.keys():         # see if we have the href attribute in the tag
                href = tag.attributes["href"].value     # get the value of the href attribute
                if validWikiArticleLinkString(href):                             # if we have one of the link types we are looking for

                    # Now come the tricky parts. We want to look for links in the main content area only,
                    # and we want the first link not in parentheses.

                    # assume the link is valid.
                    invalid = False            

                    # tables which appear to the right on the site appear first in the DOM, so we need to make sure
                    # we are not looking at a <a> tag somewhere inside a <table>.
                    pn = tag.parentNode                     
                    while pn is not None:
                        if str(pn).find("table at") >= 0:
                            invalid = True
                            break
                        else:
                            pn = pn.parentNode 

                    if invalid:     # go to next link
                        continue               

                    # Next we look at the descriptive texts above the article, if any; e.g
                    # This article is about .... or For other uses, see ... (disambiguation).
                    # These kinds of links will lead into loops so we classify them as invalid.

                    # We notice that this text does not appear to be inside a <p> block, so
                    # we dismiss <a> tags which aren't inside any <p>.
                    pnode = tag.parentNode
                    while pnode is not None:
                        if str(pnode).find("p at") >= 0:
                            break
                        pnode = pnode.parentNode
                    # If we have reached the root node, which has parentNode None, we classify the
                    # link as invalid.
                    if pnode is None:
                        invalid = True

                    if invalid:
                        continue


                    ######  this is where I got stuck:
                    # now we need to look if the link is inside parentheses. below is some junk

#                    for elem in tag.parentNode.childNodes:
#                        while elem.firstChild is not None:
#                            elem = elem.firstChid
#                        print elem.nodeValue

                    print href      # this will be the next link
                    newLink = href[6:]  # except for the /wiki/ part
                    break

        # if we have been to this link before, break the loop
        if newLink in visited:
            print "Stuck in loop."
            break
        # or if we have reached Philosophy
        elif newLink == "Philosophy":
            print "Ended up in Philosophy."
            break
        else:
            visited.append(currentPage)     # mark this currentPage as visited
            currentPage = newLink           # make the the currentPage we found the new page to fetch
            time.sleep(5)                   # sleep some to see results as debug

获取维基百科文章中的第一个链接，不在括号内 [英] Get the first link in a Wikipedia article not inside parentheses

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

获取维基百科文章中的第一个链接，不在括号内 [英] Get the first link in a Wikipedia article not inside parentheses

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭