为什么ElementTree引发ParseError? [英] Why is ElementTree raising a ParseError?

查看:390
本文介绍了为什么ElementTree引发ParseError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图用xml.etree.ElementTree解析文件:

I have been trying to parse a file with xml.etree.ElementTree:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError

def analyze(xml):
    it = ET.iterparse(file(xml))
    count = 0
    last = None

    try:        
        for (ev, el) in it:
            count += 1
            last = el

    except ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

    print('count: {0}'.format(count))

这当然是我的代码的简化版本,但这足以破坏我的程序.如果删除try-catch块,则会在某些文件中出现此错误:

This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:

Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    from yparse import analyze; analyze('file.xml')
  File "C:\Python27\yparse.py", line 10, in analyze
    for (ev, el) in it:
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
ParseError: reference to invalid character number: line 1, column 52459

结果是确定性的,如果文件有效,它将始终有效.如果文件失败,它总是失败并且总是在同一点失败.

The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.

最奇怪的是,我正在使用跟踪来发现我是否有任何格式错误的XML破坏了解析器.然后,我隔离导致故障的节点.但是,当我创建一个包含该节点及其几个邻居的XML文件时,解析就可以了!

The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!

这似乎也不是尺寸问题.我已经成功地解析了更大的文件.

This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.

有什么想法吗?

推荐答案

正如@John Machin所建议的那样,有问题的文件中确实包含可疑的数字实体,尽管错误消息似乎指向文本中的错误位置. .也许流的性质和缓冲使得难以报告准确的位置.

As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.

事实上,所有这些实体都显示在文本中:

In fact, all of these entities appear in the text:

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

大多数都不被允许.看起来这个解析器非常严格,您需要找到另一个不是那么严格的解析器,或者对XML进行预处理.

Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

这篇关于为什么ElementTree引发ParseError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆