在 Python 3.2 中使用 HTMLParser [英] Using HTMLParser in Python 3.2

查看:29
本文介绍了在 Python 3.2 中使用 HTMLParser的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 HTML Parser 从网站上抓取数据并在这样做的同时剥离 html 编码.我知道各种模块,例如 Beautiful Soup,但决定走不依赖外部"模块的道路.Eloff 提供了一段代码代码:在 Python 中从字符串中剥离 HTML

I have been using HTML Parser to scrapping data from websites and stripping html coding whilst doing so. I'm aware of various modules such as Beautiful Soup, but decided to go down the path of not depending on "outside" modules. There is a code code supplied by Eloff: Strip HTML from strings in Python

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

它适用于 Python 3.1.但是,我最近升级到 Python 3.2.x 并发现我收到了有关上述 HTML 解析器代码的错误.

It works in Python 3.1. However, I recently upgraded to Python 3.2.x and have found I get errors regarding the HTML Parser code as written above.

我的第一个错误指向该行:

My first error points to the line:

s.feed(html)

... 错误提示...

... and the error says ...

AttributeError: 'MLStripper' object has no attribute 'strict'

所以,经过一些研究,我在第一行添加了strict=True",使它...

So, after a bit of research, I add "strict=True" to the top line, making it...

class MLStripper(HTMLParser, strict=True)

但是,我得到了新的错误:

However, I get the new error of:

TypeError: type() takes 1 or 3 arguments

为了看看会发生什么,我删除了self"参数并留在了strict=True"中......这放弃了错误:

To see what would happen, I removed the "self" argument and left in the "strict=True"... which gave up the error:

NameError: global name 'self' is not defined

...我有一种我在猜测"的感觉.

... and I got the "I'm guessing on guesses" feeling.

我不知道 class MLStripper(HTMLParser) 行中的第三个参数是什么,在 selfstrict=True 之后;研究没有产生任何启示.

I have no idea what the third argument in the class MLStripper(HTMLParser) line would be, after self and strict=True; research didn't toss any enlightenment.

推荐答案

您正在继承 HTMLParser,但并未调用其 __init__ 方法.您需要在 __init__ 方法中添加一行:

You're subclassing HTMLParser, but you aren't calling its __init__ method. You need to add one line to your __init__ method:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

此外,对于 Python 3,导入行是:

Also, for Python 3, the import line is:

from html.parser import HTMLParser

通过这些更改,一个简单的示例就可以工作了.不要更改 class 行,这是不相关的.

With these changes, a simple example works. Don't change the class line, that's not related.

这篇关于在 Python 3.2 中使用 HTMLParser的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆