xml解析转义字符 [英] xml parsing escape characters

查看:73
本文介绍了xml解析转义字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

-----开始PGP签名消息-----

哈希:SHA1





我只知道一点xml,我正在尝试解析一个xml文件

,以便将其元素保存在一个文件中(列表中的字典)。


当我在Linux下运行的python 2.3.3中使用

以下行访问URL时:

resposta = urllib.urlopen(url )

xmldoc = minidom.parse(resposta)

resposta.close()


我得到以下结果:


<?xml version =" 1.0" encoding =" utf-8"?>

< string xmlns =" http://www.e..">& lt; DataSet& gt;

~& lt;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

(...其他......)

~& lt; / Order& gt;

& lt; / DataSet& gt;< / string> < br $>
__________________________________________________ ___________


在下面的行中,我尝试从字符串中获取所有子节点,首先通过计算它们来计算
,然后忽略/ n个:


stringNode = xmldoc.childNodes [0]

print stringNode.toxml()

dataSetNode = stringNode.childNodes [0]

numNos = len(dataSetNode.childNodes)

todosNos = {}

for no in range(numNos ):

todosNos [no] = dataSetNode.childNodes [no] .toxml()

posicaoXml = [否则todosNos.keys()中的否如果len(todosNos [没有])> 4]

打印posicaoXml


(我'我几乎可以肯定有一种更简单的方法可以做到这一点......)

__________________________________________________ ___________


我没有得到任何元素。但是,如果我通过浏览器访问相同的URL,

浏览器窗口中的结果如下:


< string xmlns =" http ://www....">

~< DataSet>

~<订单>

~< ;客户> 439< /客户>

(......其他......)

~< /订单>

~< / DataSet>

< / string>


和我发布的行按预期工作。


我我已经浏览了网页,我知道它是关于转义字符的,但是

我没有找到一个简单的解决方案。


我试图使用LL2XML.py和unescape函数进行简单的替换

text = text.replace("& lt;","<")

但是我必须将xml文档转换为字符串然后我不能(或者不知道)如何将它转换回xml对象。


我该如何解决这个问题?请解释一下,我只是说我只是用bml打招呼,而且我对Python也不是很有经验。

Luis

----- BEGIN PGP SIGNATURE -----

版本:GnuPG v1.2.4(GNU / Linux)

评论:使用GnuPG和Thunderbird - http://enigmail.mozdev.org


iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3

XUtZ0JoHeTavBOCYi6YYnNo =

= VORM

----- END PGP SIGNATURE -----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I only know a little bit of xml and I''m trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen(url)
xmldoc = minidom.parse(resposta)
resposta.close()

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
__________________________________________________ ___________

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes[0]
numNos = len(dataSetNode.childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.childNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I''m almost sure there''s a simpler way to do this...)
__________________________________________________ ___________

I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("&lt;", "<")
but I had to convert the xml document to string and then I could not (or
don''t know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I''m just
beggining with Xml and I''m not very experienced in Python, too.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3
XUtZ0JoHeTavBOCYi6YYnNo=
=VORM
-----END PGP SIGNATURE-----

推荐答案

Luis P. Mendes写道:
Luis P. Mendes wrote:
我得到以下结果:

<?xml version = " 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
〜 &安培; LT;订单&安培; GT;


最有可能,这个结果是正确的,你的文件确实包含


& lt;订单>


我没有得到任何元素。但是,如果我通过浏览器访问相同的URL,则浏览器窗口中的结果类似于:

< string xmlns =" http:// www .... ..">
〜< DataSet>


最有可能的是,您的浏览器不正确(或至少令人困惑),并且

渲染& lt;作为<,尽管这不是标记。

我已经浏览了网页,我知道这是关于转义字符的,但是我没有找到一个简单的解决方案。
I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
Most likely, this result is correct, and your document
really does contain

&lt;Order&gt;

I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
Most likely, your browser is incorrect (or atleast confusing), and
renders &lt; as "<", even though this is not markup.
I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.




不确定this是什么是。 AFAICT,一切正常。


问候,

马丁



Not sure what "this" is. AFAICT, everything works correctly.

Regards,
Martin


----- BEGIN PGP SIGNED MESSAGE -----

哈希:SHA1


这是xml文件:


< ;?xml version =" 1.0" encoding =" utf-8"?>

< string xmlns =" http://www.e..">& lt; DataSet& gt;

~& lt;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

(...其他......)

~& lt; / Order& gt;

& lt; / DataSet& gt;< / string>


当我这样做时:


print xmldoc.toxml()


它打印:

<?xml version =" 1.0" ?>

< string xmlns =" http://www...">& lt; DataSet& gt;

~& lt ;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;


~& lt; / Order& ; gt;

& lt; / DataSet& gt;< / string>


__________________________________________________ ________

with:stringNode = xmldoc.childNodes [0]

print stringNode.toxml()

我得到:

< string xmlns =" http:/ /www.......">&lt;DataSet&gt;

~& lt; Order& gt;

~& lt ;客户>>& lt; /客户>


~& lt; /订单& gt;

& lt; / DataSet& ; gt;< / string>

__________________________________________________ ____________________


with:DataSetNode = stringNode.childNodes [0]

print DataSetNode .toxml()


我得到:


& lt; DataSet& amp; gt;

~& lt; Order& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;


~& lt; / Order& gt;

& lt; / DataSet& gt;

__________________________________________________ _____________-


到目前为止一切顺利,但是当我发出命令时:


print DataSetNode.childNodes [0]


我得到:

IndexError:元组索引超出范围


为什么错误,为什么它会返回一个元组?

为什么不退货:

& lt;订单& gt;

& lt;客户> 439& lt; /客户> gt ;


& lt; / Order& gt;

??

----- BEGIN PGP SIGNATURE-- ---

版本:GnuPG v1.2.4(GNU / Linux)

评论:使用GnuPG和Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCF D / hps8ybQli8HAs3iSCvRjwqjACfS / 12

5gctpB91S5cy299e / TVLGQk =

= XR2a

----- END PGP SIGNATURE -----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

When I do:

print xmldoc.toxml()

it prints:
<?xml version="1.0" ?>
<string xmlns="http://www...">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

__________________________________________________ ________
with: stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
I get:
<string xmlns="http://www.......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
__________________________________________________ ____________________

with: DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()

I get:

&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;
__________________________________________________ _____________-

so far so good, but when I issue the command:

print DataSetNode.childNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?
Why doesn''t it return:
&lt;Order&gt;
&lt;Customer&gt;439&lt;/Customer&gt;

&lt;/Order&gt;
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
5gctpB91S5cy299e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----


Luis P. Mendes写道:
Luis P. Mendes wrote:
-----开始PGP签名消息-----
哈希:SHA1

这是xml文档:

<?xml version =" 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
〜 & lt; Order& gt;
〜& lt; Customer& gt; 439& lt; / Customer& gt;
(...其他......)
〜& lt; / Order& gt;
& lt; / DataSet& gt;< / string>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>




这是一个包含单个标签的XML文档,< string>,其内容为包含

实体转义XML的文本。


这不是*包含标签的XML文档< DataSet> ,<订单>,<客户>等


您看到的所有行为都是这样的结果。你需要忘记

< string>的内容。标签能够将其视为结构化XML。


Kent



This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.

Kent


这篇关于xml解析转义字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆