Python:Unicode 和 ElementTree.parse [英] Python: Unicode and ElementTree.parse

查看:47
本文介绍了Python:Unicode 和 ElementTree.parse的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试转向 Python 2.7,因为 Unicode 在那里很重要,我会尝试使用 XML 文件和文本处理它们并使用 xml.etree.cElementTree 解析它们图书馆.但是我遇到了这个错误:

<预><代码>>>>导入 xml.etree.cElementTree 作为 ET>>>从 io 导入 StringIO>>>源 = """\... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>... <root>... <家长>... <孩子>... <元素>文本</元素>... </孩子>... </父母>... </root>……">>>srcbuf = StringIO(source.decode('utf-8'))>>>doc = ET.parse(srcbuf)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,位于 <module>解析中的文件",第 56 行解析中的文件",第 35 行cElementTree.ParseError:未找到元素:第 1 行,第 0 列

使用 io.open('filename.xml', encoding='utf-8') 传递给 ET.parse 会发生同样的事情:

<预><代码>>>>使用 io.open('test.xml', mode='w', encoding='utf-8') 作为 fp:... fp.write(source.decode('utf-8'))...150L>>>使用 io.open('test.xml', mode='r', encoding='utf-8') 作为 fp:... fp.read()...u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n <Parent>\n<Child>\n <Element>Text</Element>\n </Child>\n </Parent>\n</root>\n'>>>使用 io.open('test.xml', mode='r', encoding='utf-8') 作为 fp:... ET.parse(fp)...回溯(最近一次调用最后一次):文件<stdin>",第 2 行,在 <module>解析中的文件",第 56 行解析中的文件",第 35 行cElementTree.ParseError:未找到元素:第 1 行,第 0 列

我在这里遗漏了有关 unicode 和 ET 解析的内容吗?

edit:显然,ET 解析器不能很好地处理 unicode 输入流?以下工作:

<预><代码>>>>使用 io.open('test.xml', mode='rb') 作为 fp:... ET.parse(fp)...<0x0180BC10处的ElementTree对象>

但这也意味着如果我想从内存中解析文本,我不能使用 io.StringIO,除非我先将它编码到内存缓冲区中?

解决方案

你不能用

doc = ET.fromstring(source)

在你的第一个例子中?

I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the xml.etree.cElementTree library. But I ran across this error:

>>> import xml.etree.cElementTree as ET
>>> from io import StringIO
>>> source = """\
... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
... <root>
...   <Parent>
...     <Child>
...       <Element>Text</Element>
...     </Child>
...   </Parent>
... </root>
... """
>>> srcbuf = StringIO(source.decode('utf-8'))
>>> doc = ET.parse(srcbuf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

The same thing happens using io.open('filename.xml', encoding='utf-8') to pass to ET.parse:

>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
...     fp.write(source.decode('utf-8'))
...
150L
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     fp.read()
...
u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n  <Parent>\n
    <Child>\n      <Element>Text</Element>\n    </Child>\n  </Parent>\n</root>\n
'
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     ET.parse(fp)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Is there something about unicode and ET parsing that I am missing here?

edit: Apparently, the ET parser does not play well with unicode input stream? The following works:

>>> with io.open('test.xml', mode='rb') as fp:
...     ET.parse(fp)
...
<ElementTree object at 0x0180BC10>

But this also means I cannot use io.StringIO if I want to parse from an in-memory text, unless I encode it first into an in-memory buffer?

解决方案

Can't you use

doc = ET.fromstring(source)

in your first example ?

这篇关于Python:Unicode 和 ElementTree.parse的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆