iterparse和unicode [英] iterparse and unicode

查看:52
本文介绍了iterparse和unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎xml.etree.cElementTree.iterparse()不能识别unicode:


>>来自StringIO import StringIO
来自xml.etree.cElementTree import iterparse
s = u''< name> \ u03a0 \ u03b1 \ u03bd \ u03b1 \ u03b3 \\ \\ u03b9 \ u03ce \ u03c4 \ u03b7 \ u03c2< / name>''
for event,elem in iterparse(StringIO(s)):



.... print elem.text

....

Traceback(最近一次调用最后一次):

文件"< stdin>",第1行,在< module>

文件"< string>",第64行,在__iter__

UnicodeEncodeError:''ascii''编解码器无法对字符进行编码

6-15:序数不在范围内(128)


我是否正确使用它或者它目前不支持unicode?


George

解决方案

8月21日上午8:36,George Sakkis< george.sak ... @ gmail.comwrote:


似乎xml.etree.cElementTree.iterparse()不能识别unicode:


>来自StringIO import StringIO
来自xml.etree.cElementTree import iterparse
s = u''< name> \ u03a0 \ u03b1 \ u03bd \ u03b1 \\ \\ u03b3 \ u03b9 \ u03ce \ u03c4 \ u03b7 \ u03c2< / name>''
for event,elem in iterparse(StringIO(s)):



... print elem.text

...

Traceback(最近一次调用最后一次):

文件"< stdin>",第1行,在< module>

文件"< string>",第64行,在__iter__

UnicodeEncodeError:''ascii''编解码器无法对字符进行编码

6-15:序数不在范围内(128)


我是否正确使用它或者它目前不支持unicode?



嗨乔治,

我绝不是XML大师,但据我所知,你会

需要将您的文本编码为UTF-8,并在前面添加类似''<?xml

version =" 1.0"编码= QUOT; UTF-8英寸standalone =" yes"?>''。这似乎是XML的方式,而不是ElementTree的问题。


例如


>>来自StringIO import StringIO
来自xml.etree.cElementTree import iterparse
s = u''< wrapper> ;< name> \ u03a0 \ u03b1< / name>< digits> 01234567< / digits>< / wrapper>''


>> h =''<?xml version =" 1.0"编码= QUOT; UTF-8英寸standalone =" yes"?>''
xml = h + s.encode(''utf8'')
for event,elem in iterparse(StringIO(xml)):



.... print elem.tag,repr(elem.text)

....

name u''\\\03O0 \ u03b1''

digits''01234567''

wrapper None


>>>



HTH,

John


在星期三,2008-08-20 15:36 -0700,George Sakkis写道:


似乎xml.etree.cElementTree.iterparse()不是unicode意识:


> from StringIO import StringIO
from xml.etree.cElementTree import iterparse
s = u ''< name> \ u03a0 \ u03b1 \ u03bd \ u03b1 \ u03b3 \ u03b9 \ u03ce \ u03c4 \ u03b7 \ u03c2< / name>''
活动,elem在iterparse(StringIO(s)):



... print elem.text

...

回溯(最近一次调用最后一次):

文件"< stdin>",第1行,< module>

文件" < string>",第64行,在__iter__

UnicodeEncodeError:''ascii''编解码器不能编码ch位置上的角色

6-15:序数不在范围内(128)


我是否正确使用它或者它目前不支持unicode?


乔治

-
http://mail.python.org/mailman/listinfo/python-list



因为iterparse需要一个实际的文件输入,使用unicode字符串

有问题。如果你想使用iterparse,最简单的方法是将
编码你的字符串,然后再插入到StringIO对象中,如下所示:


??? >> for event,elem in iterparse(StringIO(s.encode(''UTF8'')):

.... print elem.text

。 ...


如果使用UTF-8进行编码,则无需担心先前建议的<?xml标题

位,因为它是XML的默认值。


如果你广泛使用unicode,你应该考虑使用lxml,

来实现同样的接口作为ElementTree,但处理unicode

更好(虽然它没有运行上面的例子没有第一个

编码字符串):
http://codespeak.net/lxml/parsing.ht...nicode -strings


您可能还会发现目标解析器界面更具竞争力虽然它需要一个不同的解析界面,但是它需要一个不同的解析界面:
http://codespeak.net/lxml/parsing.ht...rser-interface


- -

John Krukoff< jk ****** @ ltgc.com>

土地所有权担保公司


谢谢你们的建议。我做了一些实验,以了解b
了解iterparse在三个维度上的表现:


a。是否在标题中声明了编码(如果有)?

b。文本是否为ascii-encodable(即在范围(128)内)?

c。传递的文件对象的read()方法是返回str还是unicode

(例如codecs.open(f,encoding =''utf8''))?


如果我误解了真正发生的事情,请随意纠正我。


正如John Krukoff所说,省略编码等同于

encoding = " UTF-8英寸对于所有其他组合。这留下了(b)和(c)。


如果一个文本节点是ascii-encodable,iterparse()将它作为一个字节返回

字符串,无论如何声明的编码和输入文件'

read()返回类型。


(c)仅在文本节点不是ascii时才变得相关-encodable。在

这种情况​​下,如果底层文件的read()

返回匹配的编码中的字节(或至少是兼容的

with)标题中的声明编码(或隐含的utf8)。

传递read()返回unicode字符的文件对象

隐式编码它们ascii,它引发了一个UnicodeEncodeError

,因为文本节点不是ascii-encodable。


成功后元素文本属性很有意思

解析不一定有相同的类型,即所有是str或all

unicode。我从BeautifulSoup中移植了一些文本提取代码(

将所有文本作为unicode处理)并且我很惊讶地发现在

xml.etree中返回的文本''s类型不固定,即使在相同的

文件中也是如此。虽然它不是一个bug,但是来自同一来源的字节和

unicode字符串的混合集合让我有点不安。


George

It seems xml.etree.cElementTree.iterparse() is not unicode aware:

>>from StringIO import StringIO
from xml.etree.cElementTree import iterparse
s = u''<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce \u03c4\u03b7\u03c2</name>''
for event,elem in iterparse(StringIO(s)):

.... print elem.text
....
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 64, in __iter__
UnicodeEncodeError: ''ascii'' codec can''t encode characters in position
6-15: ordinal not in range(128)

Am I using it incorrectly or it doesn''t currently support unicode ?

George

解决方案

On Aug 21, 8:36 am, George Sakkis <george.sak...@gmail.comwrote:

It seems xml.etree.cElementTree.iterparse() is not unicode aware:

>from StringIO import StringIO
from xml.etree.cElementTree import iterparse
s = u''<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce \u03c4\u03b7\u03c2</name>''
for event,elem in iterparse(StringIO(s)):


... print elem.text
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 64, in __iter__
UnicodeEncodeError: ''ascii'' codec can''t encode characters in position
6-15: ordinal not in range(128)

Am I using it incorrectly or it doesn''t currently support unicode ?

Hi George,
I''m no XML guru by any means but as far as I understand it, you would
need to encode your text into UTF-8, and prepend something like ''<?xml
version="1.0" encoding="UTF-8" standalone="yes"?>'' to it. This appears
to be the way XML is, rather than an ElementTree problem.

E.g.

>>from StringIO import StringIO
from xml.etree.cElementTree import iterparse
s = u''<wrapper><name>\u03a0\u03b1</name><digits>01234567</digits></wrapper>''

>>h = ''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>''
xml = h + s.encode(''utf8'')
for event,elem in iterparse(StringIO(xml)):

.... print elem.tag, repr(elem.text)
....
name u''\u03a0\u03b1''
digits ''01234567''
wrapper None

>>>

HTH,
John


On Wed, 2008-08-20 at 15:36 -0700, George Sakkis wrote:

It seems xml.etree.cElementTree.iterparse() is not unicode aware:

>from StringIO import StringIO
from xml.etree.cElementTree import iterparse
s = u''<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce \u03c4\u03b7\u03c2</name>''
for event,elem in iterparse(StringIO(s)):

... print elem.text
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 64, in __iter__
UnicodeEncodeError: ''ascii'' codec can''t encode characters in position
6-15: ordinal not in range(128)

Am I using it incorrectly or it doesn''t currently support unicode ?

George
--
http://mail.python.org/mailman/listinfo/python-list

As iterparse expects an actual file as input, using a unicode string is
problematic. If you want to use iterparse, the simplest way would be to
encode your string before inserting it into the StringIO object, as so:

???>>for event,elem in iterparse(StringIO(s.encode(''UTF8'')):
.... print elem.text
....

If you encode using UTF-8, you don''t need to worry about the <?xml header
bit as suggested previously, as it''s the default for XML.

If you''re using unicode extensively, you should consider using lxml,
which implements the same interface as ElementTree, but handles unicode
better (though it also doesn''t run your example above without first
encoding the string):
http://codespeak.net/lxml/parsing.ht...nicode-strings

You may also find the target parser interface to be more accepting of
unicode than iterparse, though it requires a different parsing interface:
http://codespeak.net/lxml/parsing.ht...rser-interface

--
John Krukoff <jk******@ltgc.com>
Land Title Guarantee Company


Thank you both for the suggestions. I made a few more experiments to
understand how iterparse behaves with respect to three dimensions:

a. Is the encoding declared in the header (if there is one) ?
b. Is the text ascii-encodable (i.e. within range(128)) ?
c. Does the passed file object''s read() method return str or unicode
(e.g. codecs.open(f,encoding=''utf8'')) ?

Feel free to correct me if I misinterpreted what is really happening.

As John Krukoff mentioned, omitting the encoding is equivalent to
encoding="utf-8" for all other combinations. This leaves (b) and (c).

If a text node is ascii-encodable, iterparse() returns it as a byte
string, regardless of the declared encoding and the input file''s
read() return type.

(c) becomes relevant only if a text node is not ascii-encodable. In
this case iterparse() returns unicode if the underlying file''s read()
returns bytes in an encoding that matches (or at least is compatible
with) the declared encoding in the header (or the implied utf8).
Passing a file object whose read() returns unicode characters
implicitly encodes them to ascii, which raises a UnicodeEncodeError
since the text node is not ascii-encodable.

It''s interesting that the element text attributes after a successful
parse do not necessarily have the same type, i.e. all be str or all
unicode. I ported some text extraction code from BeautifulSoup (which
handles all text as unicode) and I was surprized to find out that in
xml.etree the returned text''s type is not fixed, even within the same
file. Although it''s not a bug, having a mixed collection of byte and
unicode strings from the same source makes me somewhat uneasy.

George


这篇关于iterparse和unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆