使用xml.etree.ElementTree解析XHTML [英] Parsing XHTML using xml.etree.ElementTree

查看:180
本文介绍了使用xml.etree.ElementTree解析XHTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 xml.etree.ElementTree 在Python 3中解析XHTML文档。该文档包含& nbsp; 实体,因此我无法使用默认的解析器设置。我想做类似的事情:

I want to use xml.etree.ElementTree to parse an XHTML document in Python 3. The document contains   entities, so I cannot use the default parser settings. I'd like to do something similar to:

with urllib.request.urlopen(BASE_URL) as url:
        body = url.read()
        parser = ET.XMLParser()
        parser.parser.UseForeignDTD(True)
        parser.entity.update(entitydefs)
        etree = ET.ElementTree()
        root = etree.fromstring(body)

但是 fromstring ElementTree 中的免费函数。如何使用 ElementTree 实例实现类似的目的?

But fromstring is a free function in ElementTree. How can I achieve something similar with ElementTree instance?

推荐答案

遇到同样的问题。问题和选择的答案中的示例代码可能以前可以使用,但现在在我的Python 3.3和Python 3.4环境中不起作用。

Well I encountered same problem. The sample code in the question and the chosen answer might work before, but right now it won't work in my Python 3.3 and Python 3.4 environment.

我终于明白了加工。引自此问与答

I finally got it working. Quoted from this Q&A.

灵感来自这篇文章,我们可以在传入的原始HTML内容之前添加一些XML定义,然后ElementTree即可使用。

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

这对Python 2.6、2.7、3.3、3.4均适用。

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)

这篇关于使用xml.etree.ElementTree解析XHTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆