使用xml.etree.ElementTree解析XHTML [英] Parsing XHTML using xml.etree.ElementTree
问题描述
我想使用 xml.etree.ElementTree
在Python 3中解析XHTML文档。该文档包含& nbsp;
实体,因此我无法使用默认的解析器设置。我想做类似的事情:
I want to use xml.etree.ElementTree
to parse an XHTML document in Python 3. The document contains
entities, so I cannot use the default parser settings. I'd like to do something similar to:
with urllib.request.urlopen(BASE_URL) as url:
body = url.read()
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()
root = etree.fromstring(body)
但是 fromstring
是 ElementTree
中的免费函数。如何使用 ElementTree
实例实现类似的目的?
But fromstring
is a free function in ElementTree
. How can I achieve something similar with ElementTree
instance?
推荐答案
遇到同样的问题。问题和选择的答案中的示例代码可能以前可以使用,但现在在我的Python 3.3和Python 3.4环境中不起作用。
Well I encountered same problem. The sample code in the question and the chosen answer might work before, but right now it won't work in my Python 3.3 and Python 3.4 environment.
我终于明白了加工。引自此问与答。
I finally got it working. Quoted from this Q&A.
灵感来自这篇文章,我们可以在传入的原始HTML内容之前添加一些XML定义,然后ElementTree即可使用。
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
这对Python 2.6、2.7、3.3、3.4均适用。
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET
html = '''<html>
<div>Some reasonably well-formed HTML content.</div>
<form action="login">
<input name="foo" value="bar"/>
<input name="username"/><input name="password"/>
<div>It is not unusual to see in an HTML page.</div>
</form></html>'''
magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY nbsp ' '>
]>''' # You can define more entities here, if needed
et = ET.fromstring(magic + html)
这篇关于使用xml.etree.ElementTree解析XHTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!