如何使用Python的HTMLParser提取特定链接 [英] How to use Python's HTMLParser to extract specific links

查看:376
本文介绍了如何使用Python的HTMLParser提取特定链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用HTMLParser类在Python中使用基本的Web爬虫。我使用修改后的handle_starttag方法获取我的链接,如下所示:

I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:

def handle_starttag(self, tag, attrs):
    if tag == 'a':
        for (key, value) in attrs:
            if key == 'href':
                newUrl = urljoin(self.baseUrl, value)
                self.links = self.links + [newUrl]

当我想要的时候效果非常好在页面上找到每个链接。现在我只想获取某些链接。

This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links.

我如何才能获取< td class =title之间的链接> < / td> 标签,如下所示:

How would I go about only fetching links that are between the <td class="title"> and </td> tags, like this:

<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>


推荐答案

HTMLParser是一种SAX风格或流式解析器,意味着您可以在解析文档时获取文档,但不能同时获取整个文档。解析器调用您提供的方法来处理标记和其他类型的数据。您可能感兴趣的任何上下文,例如哪些标记位于其他标记内,您必须从您看到的标记中收集。

HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested yourself, such as which tags are inside other tags, you must glean from the tags you see passing by.

例如,如果您看到< td> 标记,那么您就知道自己位于表格单元格中,并可以设置该效果的标志。当您看到< / td> 时,您知道已离开表格单元格并可以清除该标记。要获取表格单元格中的链接,那么,如果您看到< a> 并且您知道自己位于表格单元格中(因为您设置了该标记),你获取了标签的 href 属性的值(如果有)。

For example, if you see a <td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see </td>, you know you have left a table cell and can clear that flag. To get the links inside a table cell, then, if you see <a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag's href attribute if it has one.

from HTMLParser import HTMLParser

class LinkExctractor(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.extracting = False
        self.links      = []

    def handle_startag(self, tag, attrs):
        if tag == "td" or tag == "a":
            attrs = dict(attrs)   # save us from iterating over the attrs
        if tag == "td" and attrs.get("class", "") == "title":
            self.extracting = True
        elif tag == "a" and "href" in attrs and self.extracting:
            self.links.append(attrs["href"])

    def handle_endtag(self, tag):
        if tag == "td":
            self.extracting = False

这很快就会变得很痛苦,因为你需要越来越多的上下文来从文档中获得你想要的东西,这就是人们推荐的原因 lxml BeautifulSoup 。这些是DOM样式的解析器,可以为您跟踪文档层次结构,并提供各种友好的方式来导航它,例如DOM API,XPath和/或CSS选择器。

This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending lxml and BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.

顺便说一下,我最近在这里回答了类似的问题。

BTW, I answered a similar question recently here.

这篇关于如何使用Python的HTMLParser提取特定链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆