HTMLParser是否可以区分链接文本和其他数据? [英] Have HTMLParser differentiate between link-text and other data?

查看:62
本文介绍了HTMLParser是否可以区分链接文本和其他数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有与此类似的html代码:

Say I have html code similar to this:

<a href="http://example.org/">Stuff I do want</a>
<p>Stuff I don't want</p>

使用HTMLParser的handle_data不能区分链接文本(我想要的东西)(这是否是正确的术语?)和我不需要的东西. HTMLParser是否具有使handle_data仅返回链接文本而不返回其他内容的内置方法?

Using HTMLParser's handle_data doesn't differentiate between the link-text(stuff I do want)(Is this even the right term?) and the stuff I don't want. Does HTMLParser have a built-in way to have handle_data return only link-text and nothing else?

推荐答案

基本上,您还必须编写一个handle_starttag()方法.只需保存您看到的每个标签为self.lasttag之类的东西.然后,在您的handle_data()方法中,只需检查self.lasttag并查看它是否为'a'(表明您看到的最后一个标签是HTML锚标签,因此您处于链接中).

Basically you have to write a handle_starttag() method as well. Just save off every tag you see as self.lasttag or something. Then, in your handle_data() method, just check self.lasttag and see if it's 'a' (indicating that the last tag you saw was an HTML anchor tag and therefore you're in a link).

类似(未经测试)的东西应该起作用:

Something like this (untested) should work:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    lasttag = None

    def handle_starttag(self, tag, attr):
        self.lasttag = tag.lower()

    def handle_data(self, data):
        if self.lasttag == "a" and data.strip():
            print data

实际上,在HTML中允许在<a...> ... </a>容器中包含其他标签.并且也可能有包含文本但不是链接的锚(没有href=属性).如果需要,这些情况都可以处理.同样,此代码未经测试:

In fact it's permissible in HTML to have other tags inside an <a...> ... </a> container. And there can also be anchors that contain text but aren't links (no href= attribute). These cases can both be handled if desired. Again, this code is untested:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    inlink = False
    data   = []

    def handle_starttag(self, tag, attr):
        if tag.lower() == "a" and "href" in (k.lower() for k, v in attr):
           self.inlink = True
           self.data   = []

    def handle_endtag(self, tag):
        if tag.lower() == "a":
            self.inlink = False
            print "".join(self.data)

    def handle_data(self, data):
        if self.inlink:
            self.data.append(data)

HTMLParser是您所谓的SAX样式的解析器,它可以将经过的标签通知您,但您可以自己跟踪标签的层次结构.您可以在此处看到仅通过第一个版本和第二个版本之间的差异而变得多么复杂.

HTMLParser is what you'd call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.

DOM样式的解析器更易于处理这些任务,因为它们将整个文档读入内存并生成易于导航和搜索的树. DOM样式的解析器倾向于使用更多的内存,并且比SAX样式的解析器要慢,但是现在这比十年前的重要性要小得多.

DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.

这篇关于HTMLParser是否可以区分链接文本和其他数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆