使用未定义的实体解析XHTML5 [英] Parse XHTML5 with undefined entities

查看:386
本文介绍了使用未定义的实体解析XHTML5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑这一点:

import xml.etree.ElementTree as ET

xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
        <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
        <head><title>XHTML sample</title></head>
            <body>
                <p>&nbsp;Sample text</p>
            </body>
        </html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = '&#x00A0;'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))

,它呈现 xhtml 字符串的漂亮文本表示。

which renders nice text representation of xhtml string.

但是,对于HTML5 doctype的相同XHTML文档:

But, for same XHTML document with HTML5 doctype:

xhtml = '''<!DOCTYPE html>
        <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
        <head><title>XHTML sample</title></head>
            <body>
                <p>&nbsp;Sample text</p>
            </body>
        </html>
'''

我得到例外:

xml.etree.ElementTree.ParseError:undefined entity:第5行,第19列

so解析器无法处理它,虽然我将 nbsp 添加到实体dict。

so the parser can't handle it, although I added nbsp to entities dict.

如果我使用 lxml

from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')

加注:

lxml.etree.XMLSyntaxError:Entity'nbsp '未定义,第5行,第26栏

虽然我已将解析器设置为忽略实体。

although I've set the parser to ignore entities.

为什么会这样,以及如何使用HTML5 doctype声明解析XHTML文件?

Why is this, and how to make parsing of XHTML files with HTML5 doctype declaration possible?

lxml的部分解决方案是使用recoverer:

Partial solution for lxml is to use recoverer:

parser = etree.XMLParser(resolv e_entities = False,recover = True)

但我还在等待更好的一个。

but I'm still waiting for better one.

推荐答案

这里的问题是,在幕后使用的Expat解析器通常不会报告未知实体 - 它会抛出一个错误,所以 xml.etree.ElementTree 中的后备代码是你的试图触发甚至不会运行。您可以使用 UseForeignDTD 方法更改此行为,它将使Expat忽略doctype声明并将所有实体声明传递给 xml.etree.ElementTree 。以下代码正常工作:

The problem here is, the Expat parser used behind the scenes won't usually report unknown entities - it will rather throw an error, so the fallback code in xml.etree.ElementTree you were trying to trigger won't even run. You can use the UseForeignDTD method to change this behavior, it will make Expat ignore the doctype declaration and pass all entity declarations to xml.etree.ElementTree. The following code works correctly:

import xml.etree.ElementTree as ET

xhtml = '''<!DOCTYPE html>
        <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
        <head><title>XHTML sample</title></head>
            <body>
                <p>&nbsp;Sample text</p>
            </body>
        </html>
'''
parser = ET.XMLParser()
parser._parser.UseForeignDTD(True)
parser.entity['nbsp'] = u'\u00A0'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))

这种方法的副作用:正如我所说,doctype声明完全被忽略了。这意味着您必须声明所有实体,甚至是doctype所涵盖的实体。

The side-effect of this approach: as I said, the doctype declaration is completely ignored. This means that you have to declare all entities, even the ones supposedly covered by the doctype.

请注意您输入的值 ElementTree.XMLParser.entity 字典必须是常规字符串,实体将被替换的文本 - 您不能再引用其他实体。所以对于& nbsp; ,它应该是 u'\\\00A0'

Note that the values you put into ElementTree.XMLParser.entity dictionary have to be regular strings, text that the entity will be replaced by - you can no longer refer to other entities there. So it should be u'\u00A0' for &nbsp;.

这篇关于使用未定义的实体解析XHTML5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆