Python minidom和带有散列引用的UTF-8编码的XML [英] Python minidom and UTF-8 encoded XML with hash references

查看:357
本文介绍了Python minidom和带有散列引用的UTF-8编码的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的家庭项目遇到一些困难,我需要解析一个SOAP请求。 SOAP是使用gSOAP生成的,并包含带有特殊字符(如丹麦字母æøå)的字符串参数。



gSOAP默认使用UTF-8编码构建SOAP请求,以原始格式发送特殊聊天字符(即,用于特殊字符æ的字节C3A6),它发送我认为被称为字符散列引用(即æ)。

我不完全明白为什么gSOAP这样做,因为我可以看到它已经标记的传入有效载荷为UTF-8编码反正(Content-Type:text / xml; charset = utf-8),但这是除了问题(我想)。



无论如何我想gSOAP可能是服从运输规则,

当我使用xml.dom.minidom.parseString()在python中解析来自gSOAP的请求时,我将元素值作为unicode对象,这是很好的,但字符哈希引用不会解码为UTF-8字符代码。它会解析字符散列引用,但不会解码字符串。最后我有一个unicode字符串对象与UTF-8编码:



所以如果字符串æble包含在XML中,它在请求:

 à
/ pre>

解析XML之后,DOM Text Node数据成员中的unicode字符串如下所示:

  u'\xc3 \xa6ble'



它看起来像这样:

  u'\xe6ble'

我做错了什么?我应该解析SOAP XML之前解析它,或者是在其他地方我应该寻找解决方案,也许gSOAP?



提前感谢。



谢谢Jakob Simon-Gaarde

解决方案

以下是如何取消此类内容的解释: http://effbot.org/zone/re-sub.htm#unescape-html < a>



但是主要的问题是你和/或这个gSOAP(URL,请))...



您的示例字符是LATIN SMALL LIGATURE AE(U + 00E6)。正如你所说,以UTF-8编码,这是 \xc3\xa6 。 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230.转义你的角色应该产生'&#230;',而不是'& ;#195;&#166;'



然而,它似乎是先编码为UTF-8,然后进行转义。



你需要做的就是详细显示你使用诊断打印的代码(使用repr()函数,这样我们可以看到类型和明确表示的内容)中的每个 str unicode 对象。还提供您使用的gSOAP API的文档。



在接收端,请向我们显示您接收的原始XML的repr() 。



编辑以回应此评论的另一个答案:问题是,minidom.parseString()似乎没有unescape



它(和任何其他XML解析器){不能,不能一般而且不能}解释数字



(1)取消转义&#60;<会炸毁



#256到? \xc4\x80



(3) UTF-16xx?


I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".

gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. &#195;&#166;).

I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).

Anyway I guess gSOAP probably is obeying transport rules, or what?

When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:

So if the string "æble" is contained in the XML, it comes like this in the request:

"&#195;&#166;ble"

After parsing the XML the unicode string in the DOM Text Node's data member looks like this:

u'\xc3\xa6ble'

I would expect it to look like this:

u'\xe6ble'

What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?

Thanks in advance.

Best regards Jakob Simon-Gaarde

解决方案

Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html

However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...

Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce '&#230;', not '&#195;&#166;'.

However it appears that it is encoding to UTF-8 first and then doing the escaping.

What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.

On the receiving end, please show us the repr() of the raw XML that you receive.

Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""

It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.

(1) unescaping "&#60;" to "<" would blow up

(2) what would you unescape "&#256" to? "\xc4\x80"?

(3) how could it unescape at all if the encoding was UTF-16xx?

这篇关于Python minidom和带有散列引用的UTF-8编码的XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆