使用 Python 解析 XML 解析外部实体引用 [英] Parse XML with Python resolving an external ENTITY reference

查看:46
本文介绍了使用 Python 解析 XML 解析外部实体引用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 S1000D xml 中,它指定了一个 DOCTYPE,其中包含对公共 URL 的引用,该 URL 包含对包含所有有效字符实体的许多其他文件的引用.我使用 xml.etree.ElementTree 和 lxml 来尝试解析它并得到一个解析错误,两者都表明:

In my S1000D xml, it specifies a DOCTYPE with a reference to a public URL that contains references to a number of other files that contain all the valid character entities. I've used xml.etree.ElementTree and lxml to try to parse it and get a parse error with both indicating:

undefined entity −: line 82, column 652

即使 − 根据指定的实体参考是有效实体.

Even though − is a valid entity according to the ENTITY Reference specfied.

xml顶部如下:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dmodule [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>

如果你出去得到 http://www.s1000d.org/S1000D_4-1/ent/ISOEntities,它将包含 20 个其他 ent 文件,其中一个名为 iso-tech.ent,其中包含以下行:

If you go out and get http://www.s1000d.org/S1000D_4-1/ent/ISOEntities, it will include 20 other ent files with one called iso-tech.ent which contains the line:

<!-- 减号-->

xml 文件第 82 行靠近第 652 列的内容如下:....参考70&minus;41....

in line 82 of the xml file near column 652 is the following: ....Refer to 70&minus;41....

如何在不获取未定义实体的情况下运行 python 脚本来解析此文件?

How can I run a python script to parse this file without get the undefined entity?

对不起,我不想指定 parser.entity['minus'] = chr(2212) 例如.我这样做是为了快速修复,但有很多字符实体引用.我希望解析器检查 xml 中指定的实体引用.

Sorry I don't want to specify parser.entity['minus'] = chr(2212) for example. I did that for a quick fix but there are many character entity references. I would like the parser to check Entity reference that is specified in the xml.

我很惊讶,但我绕着太阳转了一圈又回来了,还没有找到如何做到这一点(或者也许我找到了但无法遵循它).如果我更新我的 xml 文件并添加它不会失败,所以它不是 xml.

I'm surprised but I've gone around the sun and back and haven't found how to do this (or maybe I have but couldn't follow it). if I update my xml file and add <!ENTITY minus "&#x2212;"> It won't fail, so It's not the xml.

解析失败.这是我用于 ElementTree 的代码

It fails on the parse. Here's code I use for ElementTree

 fl = os.path.join(pth, fn)
 try:
     root = ET.parse(fl)
 except ParseError as p:
     print("ParseError : ", p)

这是我用于 lxml 的代码

Here's the code I use for lxml

fl = os.path.join(pth, fn)
try:
    parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
    root = etree.parse(fl, parser=parser)
except etree.XMLSyntaxError as pe:
    print("lxml XMLSyntaxError: ", pe)

我希望解析器加载 ENTITY 引用,以便它知道 −并且所有文件中指定的所有其他字符实体都是有效的实体字符.

I would like the parser to load the ENTITY reference so that it knows that − and all the other character entities specified in all the files are valid entity characters.

非常感谢您的建议和帮助.

Thank you so much for your advice and help.

推荐答案

我要回答 lxml.如果可以使用 lxml,就没有理由考虑 ElementTree.

I'm going to answer for lxml. No reason to consider ElementTree if you can use lxml.

我认为您缺少的部分是 XMLParser 中的 no_network=False默认为真.

I think the piece you're missing is no_network=False in the XMLParser; it's True by default.

示例...

XML 输入 (test.xml)

XML Input (test.xml)

<!DOCTYPE doc [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
<doc>
    <test>Here's a test of minus: &minus;</test>
</doc>

Python

from lxml import etree

parser = etree.XMLParser(load_dtd=True,
                         no_network=False)

tree = etree.parse("test.xml", parser=parser)

etree.dump(tree.getroot())

输出

<doc>
    <test>Here's a test of minus: −</test>
</doc>

如果您希望保留实体引用,请将 resolve_entities=False 添加到 XMLParser.

If you wanted the entity reference retained, add resolve_entities=False to the XMLParser.

另外,与其去外部位置解析参数实体,不如考虑设置一个XML 目录.这将使您能够将公共和/或系统标识符解析为本地版本.

Also, instead of going out to an external location to resolve the parameter entity, consider setting up an XML Catalog. This will let you resolve public and/or system identifiers to local versions.

使用上述相同 XML 输入的示例...

Example using same XML input above...

XML Catalog(catalog test"目录中的catalog.xml"(用于测试的目录名中的空格)

XML Catalog ("catalog.xml" in the directory "catalog test" (space used in directory name for testing))

<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <!-- The path in @uri is relative to this file (catalog.xml). -->
    <uri name="http://www.s1000d.org/S1000D_4-1/ent/ISOEntities" uri="./ents/ISOEntities_stackoverflow.ent"/>
</catalog>

实体文件(目录catalog test/ents"中的ISOEntities_stackoverflow.ent".将值更改为BAM!"以进行测试)

Entity File ("ISOEntities_stackoverflow.ent" in the directory "catalog test/ents". Changed the value to "BAM!" for testing)

<!ENTITY minus "BAM!">

Python(将 no_network 更改为 True 以获得本地版本的 http://www.s1000d.org/S1000D_4-1/ent/ISOEntities 正在使用.)

Python (Changed no_network to True for additional evidence that the local version of http://www.s1000d.org/S1000D_4-1/ent/ISOEntities is being used.)

import os
from urllib.request import pathname2url
from lxml import etree

# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
try:
    xcf_env = os.environ['XML_CATALOG_FILES']
except KeyError:
    # Path to catalog must be a url.
    catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog test/catalog.xml'))}"
    # Temporarily set the environment variable.
    os.environ['XML_CATALOG_FILES'] = catalog_path

parser = etree.XMLParser(load_dtd=True,
                         no_network=True)

tree = etree.parse("test.xml", parser=parser)

etree.dump(tree.getroot())

输出

<doc>
    <test>Here's a test of minus: BAM!</test>
</doc>

这篇关于使用 Python 解析 XML 解析外部实体引用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆