Python ElementTree在处理指令名称中不喜欢冒号 [英] Python ElementTree does not like colon in name of processing instruction

查看:171
本文介绍了Python ElementTree在处理指令名称中不喜欢冒号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码:

import xml.etree.ElementTree as ET

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''

root = ET.fromstring(xml)

xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
    root2 = ET.fromstring(xml2)
except ET.ParseError:
    print("\nERROR in xml2!!!\n")

xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
    root3 = ET.fromstring(xml3)
except ET.ParseError:
    print("\nERROR in xml3!!!\n")
    raise

给出以下输出:

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml2!!!

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml3!!!

Traceback (most recent call last):
  File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
    root3 = ET.fromstring(xml3)
  File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17

我搜索并发现了此问指向我读过的其他资源.

似乎是?"使其成为一条处理指令,其标签名称可以包含冒号.没有'?'然后名称中的冒号表示名称空间,答案之一就是定义名称空间应该可以使事情正常工作.

结合'?'和':'会导致ElementTree出现问题.

我得到了这种类型的xml文件,其他工具可以对其进行解析,并且希望自己使用Python处理这些文件.有什么想法吗?

谢谢.

解决方案

根据处理指令节点的说明:

一条处理指令具有扩展名:本地部分是 处理指令的目标;名称空间URI为空.

总的来说,<?LazyComment:Blah de blah/?>是无效的处理指令,因为冒号用于引用名称空间URI,并且用于处理部分为null或为空的指令.因此,Python的XML处理器抱怨使用这样的指令不会呈现格式正确的XML.

此外,请重新考虑正在生成此类无效处理指令的此类工具,因为它们未处理有效的XML文档.

The following code:

import xml.etree.ElementTree as ET

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''

root = ET.fromstring(xml)

xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
    root2 = ET.fromstring(xml2)
except ET.ParseError:
    print("\nERROR in xml2!!!\n")

xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
    root3 = ET.fromstring(xml3)
except ET.ParseError:
    print("\nERROR in xml3!!!\n")
    raise

Gives this output:

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml2!!!

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml3!!!

Traceback (most recent call last):
  File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
    root3 = ET.fromstring(xml3)
  File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17

I searched and found this Q that pointed to other resources that I read.

It seems that the '?' makes it a processing instruction whose tag name can include colons. Without the '?' then a colon in a name indicates namespace and one of the answers stated that defining the namespace should make things work.

Combining '?' and ':' though causes issues with ElementTree.

I am given xml files of this type that are used by other tools that do parse it OK and want to process the files myself using Python. Any ideas?

Thanks.

解决方案

According to the W3C Extensible Markup Language 1.0 Specifications under Common Syntactic Constructs:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

And further in the W3C XPath 1.0 note on Processing Instruction nodes:

A processing instruction has an expanded-name: the local part is the processing instruction's target; the namespace URI is null.

Altogether, <?LazyComment:Blah de blah/?> is an invalid processing instruction as colons is used to reference namespace URIs and for processing instructions that part is null or empty. Therefore, Python's XML processor complains that using such an instruction does not render a well-formed XML.

Also, reconsider such tools that are generating such invalid processing instructions as they are not handling valid XML documents. Possibly, such tools are treating XML files as text documents (similar to the way you were able to replace the string representation of XML but would not have been able to append an instruction using etree).

这篇关于Python ElementTree在处理指令名称中不喜欢冒号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆