是否可以使用xml.dom.pulldom使用UTF8 XML文档? [英] Is it possible to consume UTF8 XML documents using xml.dom.pulldom?

查看:67
本文介绍了是否可以使用xml.dom.pulldom使用UTF8 XML文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力让xml.dom.pulldom使用一个

UTF8编码的XML文件。这就是我到目前为止所尝试的:

I''m having a horrible time trying to get xml.dom.pulldom to consume a
UTF8 encoded XML file. Here''s what I''ve tried so far:


>> xml_utf8 ="""<?xml version =" 1.0"编码= QUOT; UTF-8英寸?>
>>xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>



< msg> Simon \xe2 \ x80 \ x99s XML噩梦< / msg>

"""

<msg>Simon\xe2\x80\x99s XML nightmare</msg>
"""


>>来自xml.dom import pulldom
parser = pulldom。 parseString(xml_utf8)
parser.next()
>>from xml.dom import pulldom
parser = pulldom.parseString(xml_utf8)
parser.next()



(''START_DOCUMENT'',< xml.dom。 minidom.Document实例位于0x6f06c0>)

(''START_DOCUMENT'', <xml.dom.minidom.Document instance at 0x6f06c0>)


>> parser.next()
>>parser.next()



(''START_ELEMENT'',< DOM元素:msg at 0x6f0710>)

(''START_ELEMENT'', <DOM Element: msg at 0x6f0710>)


>> parser.next()
>>parser.next()



....

UnicodeEncodeError:''ascii''编解码器无法编码字符u''\ u2019''

位置21 :序数不在范围内(128)


xml。 dom.minidom可以很好地处理字符串:

....
UnicodeEncodeError: ''ascii'' codec can''t encode character u''\u2019'' in
position 21: ordinal not in range(128)

xml.dom.minidom can handle the string just fine:


>>来自xml.dom import minidom
dom = minidom.parseString(xml_utf8)
dom.toxml()
>>from xml.dom import minidom
dom = minidom.parseString(xml_utf8)
dom.toxml()



u'' <?xml version =" 1.0" ?>< msg> Simon \ u2019s XML噩梦< / msg>''


如果我将unicode字符串传递给pulldom而不是utf8编码

bytestring它仍然中断:

u''<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>''

If I pass a unicode string to pulldom instead of a utf8 encoded
bytestring it still breaks:


>> xml_unicode = u''< ?xml version =" 1.0" ?>< msg> Simon \ u2019s XML噩梦< / msg>''
解析器= pulldom.parseString(xml_unicode)
>>xml_unicode = u''<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>''
parser = pulldom.parseString(xml_unicode)



....

/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/

xml / dom / parseString中的pulldom.py(字符串,解析器)

346

347 bufsize = len(字符串)

--348 buf = StringIO(string )

349如果不是解析器:

350 parser = xml.sax.make_parser()

UnicodeEncodeError:''ascii''编解码器可以'在
位置32:ordinal不在范围内(128)


是否可以使用utf8或者编码字符u''\ u2019'' unicode使用xml.dom.pulldom或

我应该尝试别的吗?


谢谢,


Simon Willison

....
/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
xml/dom/pulldom.py in parseString(string, parser)
346
347 bufsize = len(string)
--348 buf = StringIO(string)
349 if not parser:
350 parser = xml.sax.make_parser()
UnicodeEncodeError: ''ascii'' codec can''t encode character u''\u2019'' in
position 32: ordinal not in range(128)

Is it possible to consume utf8 or unicode using xml.dom.pulldom or
should I try something else?

Thanks,

Simon Willison

推荐答案

后续问题:增量消费XML的最佳方式是什么?
in Python的字符编码意识到了什么?我有一个非常大的文件来消耗
,但我不想再回到原始的SAX API。
Follow up question: what''s the best way of incrementally consuming XML
in Python that''s character encoding aware? I have a very large file to
consume but I''d rather not have to fall back to the raw SAX API.


On 30 Jul,16:32,Simon Willison< si ... @ simonwillison.netwrote:
On 30 Jul, 16:32, Simon Willison <si...@simonwillison.netwrote:

我正在努力获取xml.dom。 pulldom消耗一个

UTF8编码的XML文件。这是我到目前为止所尝试的:
I''m having a horrible time trying to get xml.dom.pulldom to consume a
UTF8 encoded XML file. Here''s what I''ve tried so far:

> xml_utf8 ="""<? xml version =" 1.0"编码= QUOT; UTF-8英寸?>
>xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>



< msg> Simon \xe2 \ x80 \ x99s XML噩梦< / msg>

" &xbsp;>>来自xml.dom import pulldom


<msg>Simon\xe2\x80\x99s XML nightmare</msg>
""">>from xml.dom import pulldom


> parser = pulldom.parseString(xml_utf8)
parser.next()
>parser = pulldom.parseString(xml_utf8)
parser.next()



(''START_DOCUMENT'',< xml.dom.minidom.Document instance at 0x6f06c0>)> > parser.next()


(''START_ELEMENT'',< DOM元素:msg at 0x6f0710>)>> parser.next()


...

UnicodeEncodeError:''ascii''编解码器无法编码字符u''\ u2019''

位置21:序数不在范围内(128)


(''START_DOCUMENT'', <xml.dom.minidom.Document instance at 0x6f06c0>)>>parser.next()

(''START_ELEMENT'', <DOM Element: msg at 0x6f0710>)>>parser.next()

...
UnicodeEncodeError: ''ascii'' codec can''t encode character u''\u2019'' in
position 21: ordinal not in range(128)



我无法在RHEL 4上的Python 2.3.6或2.4.4上重现这一点。相反,我是/>
得到通常的......


(''CHARACTERS'',< DOM Text node" Simon \ u2020s XM ..."> )

我可以将文本节点的内容作为正确的Unicode对象。


[...]

I can''t reproduce this on Python 2.3.6 or 2.4.4 on RHEL 4. Instead, I
get the usual...

(''CHARACTERS'', <DOM Text node "Simon\u2019s XM...">)

And I can get the content of the text node as a proper Unicode object.

[...]


是否可以使用xml.dom.pulldom消耗utf8或unicode或

我应该尝试别的吗?
Is it possible to consume utf8 or unicode using xml.dom.pulldom or
should I try something else?



是的,有可能,至少在Python 2.3.6和2.4.4配置

with --enable-unicode = ucs4 (这就是红帽的作用和期望)。


Paul


PS你不应该尝试将Unicode传递给解析器,因为XML完全解析会处理字节序列和字符

编码,尽管我认为''某种特性 - 基于
的解析方法(即基于Unicode值的)解析方法由某个委员会或其他委员会在某处定义。

Yes, it is possible, at least in Python 2.3.6 and 2.4.4 configured
with --enable-unicode=ucs4 (which is what Red Hat does and expects).

Paul

P.S. You shouldn''t try and pass Unicode to the parser, since XML
parsing in its entirety deals with byte sequences and character
encodings, although I suppose that there''s some kind of character-
based (ie. Unicode value-based) parsing method defined somewhere by
some committee or other.


7月30日,4:43 * pm,Paul Boddie< p ... @ boddie.org.ukwrote:
On Jul 30, 4:43*pm, Paul Boddie <p...@boddie.org.ukwrote:

我可以'在RHEL 4上的Python 2.3.6或2.4.4上重现这一点。相反,我通常会获得
...


(''字符'',< DOM Text node" Simon \ u2019s XM ...">)
I can''t reproduce this on Python 2.3.6 or 2.4.4 on RHEL 4. Instead, I
get the usual...

(''CHARACTERS'', <DOM Text node "Simon\u2019s XM...">)



我在操作系统上使用Python 2.5.1 X Leopard:

I''m using Python 2.5.1 on OS X Leopard:


这篇关于是否可以使用xml.dom.pulldom使用UTF8 XML文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆