如何用实体解析HTML,例如& nbsp;使用Python 2中的内置库ElementTree& Python 3? [英] How to parse HTML with entities such as   using builtin library ElementTree in Python 2 & Python 3?

查看:160
本文介绍了如何用实体解析HTML,例如& nbsp;使用Python 2中的内置库ElementTree& Python 3?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时候你想要解析一些格式合理的HTML页面,但是你不愿意引入额外的库依赖,比如BeautifulSoup或者lxml。所以你可能会首先尝试一下内置的ElementTree,因为它是一个标准库,它很快(用C实现),并且它比基本的HTMLParser支持更好的接口(比如XPATH支持)。更何况, HTMLParser有其自身的局限性



ElementTree会一直工作,直到遇到一些实体,例如& nbsp; ,这些实体在默认情况下不会被处理。 p>

  import xml.etree.ElementTree as ET 

html ='''< html>
< div>合理格式良好的HTML内容。< / div>
< form action =login>
< input name =foovalue =bar/>
< input name =username/>< input name =password/>

< div>看到& nbsp;在HTML页面中。< / div>

< / form>< / html>'''
et = ET.fromstring(html)

在Python 2或Python 3上运行它,您将看到以下错误:

  xml.etree.ElementTree.ParseError:undefined entity:第7行,第38列

有一些Q& amp ;在那里,如这一个那一个。他们暗示使用 ElementTree.XMLParser()。parser.UseForeignDTD(True)但我无法在Python 3.3和Python 3.4中使用它。

  $ python3.3 
Python 3.3.5(v3.3.5:62cf4e77f785,2014年3月9日,01:12:57)
[GCC 4.2.1(Apple Inc. build 5666)(dot 3)] on darwin
输入help,copyright,credits或license以获取更多信息。
>>>导入xml.etree.ElementTree作为ET
>>> ET.XMLParser()。parser
Traceback(最近一次调用最后一次):
在< module>中,第1行的文件< stdin>
AttributeError:'xml.etree.ElementTree.XMLParser'对象没有属性'parser'
>>>


解决方案

受到这篇文章,我们可以将一些XML定义添加到传入的raw HTML内容,然后ElementTree会出现问题。



这适用于Python 2.6,2.7,3.3,3.4。

  import xml.etree.ElementTree as ET 

html ='''< html>
< div>合理格式良好的HTML内容。< / div>
< form action =login>
< input name =foovalue =bar/>
< input name =username/>< input name =password/>

< div>看到& nbsp;在HTML页面中。< / div>

< / form>< / html>'''

magic ='''<!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0过渡式// EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[
<!ENTITY nbsp''>
>'''#如果需要,您可以在这里定义更多实体

et = ET.fromstring(magic + html)


There are times that you want to parse some reasonably well-formed HTML pages, but you are reluctant to introduce extra library dependency such as BeautifulSoup or lxml. So you will probably like to try the builtin ElementTree first, because it is a standard library, it is fast (implemented in C), and it supports much better interface (such as XPATH support) than the basic HTMLParser. Not to mention, HTMLParser has its own limitations.

ElementTree will work, until it encounters some entities, such as &nbsp;, which are not handled by default.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''
et = ET.fromstring(html)

Run it on Python 2 or Python 3, you will see this error:

xml.etree.ElementTree.ParseError: undefined entity: line 7, column 38

There are some Q&A out there, such as this one and that one. They hint to use ElementTree.XMLParser().parser.UseForeignDTD(True) but I can not get it work in Python 3.3 and Python 3.4.

$ python3.3
Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 01:12:57) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ET.XMLParser().parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'xml.etree.ElementTree.XMLParser' object has no attribute 'parser'
>>> 

解决方案

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)

这篇关于如何用实体解析HTML,例如&amp; nbsp;使用Python 2中的内置库ElementTree&amp; Python 3?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆