xml解析转义字符 [英] xml parsing escape characters

查看：73 发布时间：2019/6/5 7:35:45 python

本文介绍了xml解析转义字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

-----开始PGP签名消息-----

哈希：SHA1

我只知道一点xml，我正在尝试解析一个xml文件

，以便将其元素保存在一个文件中（列表中的字典）。

当我在Linux下运行的python 2.3.3中使用

以下行访问URL时：

resposta = urllib.urlopen（url ）

xmldoc = minidom.parse（resposta）

resposta.close（）

我得到以下结果：

<？xml version =" 1.0" encoding =" utf-8"？>

< string xmlns =" http：//www.e..">& lt; DataSet& gt;

~& lt;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

（...其他......）

~& lt; / Order& gt;

& lt; / DataSet& gt;< / string> < br $>
__________________________________________________ ___________

在下面的行中，我尝试从字符串中获取所有子节点，首先通过计算它们来计算
，然后忽略/ n个：

stringNode = xmldoc.childNodes [0]

print stringNode.toxml（）

dataSetNode = stringNode.childNodes [0]

numNos = len（dataSetNode.childNodes）

todosNos = {}

for no in range（numNos ）：

todosNos [no] = dataSetNode.childNodes [no] .toxml（）

posicaoXml = [否则todosNos.keys（）中的否如果len（todosNos [没有]）> 4]

打印posicaoXml

（我'我几乎可以肯定有一种更简单的方法可以做到这一点......）

__________________________________________________ ___________

我没有得到任何元素。但是，如果我通过浏览器访问相同的URL，

浏览器窗口中的结果如下：

< string xmlns =" http ：//www....">

~< DataSet>

~<订单>

~< ;客户> 439< /客户>

（......其他......）

~< /订单>

~< / DataSet>

< / string>

和我发布的行按预期工作。

我我已经浏览了网页，我知道它是关于转义字符的，但是

我没有找到一个简单的解决方案。

我试图使用LL2XML.py和unescape函数进行简单的替换

text = text.replace（"& lt;"，"<"）

但是我必须将xml文档转换为字符串然后我不能（或者不知道）如何将它转换回xml对象。

我该如何解决这个问题？请解释一下，我只是说我只是用bml打招呼，而且我对Python也不是很有经验。

Luis

----- BEGIN PGP SIGNATURE -----

版本：GnuPG v1.2.4（GNU / Linux）

评论：使用GnuPG和Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3

XUtZ0JoHeTavBOCYi6YYnNo =

= VORM

----- END PGP SIGNATURE -----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I only know a little bit of xml and I''m trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen(url)
xmldoc = minidom.parse(resposta)
resposta.close()

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>
__________________________________________________ ___________

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes[0]
numNos = len(dataSetNode.childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.childNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I''m almost sure there''s a simpler way to do this...)
__________________________________________________ ___________

I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("<", "<")
but I had to convert the xml document to string and then I could not (or
don''t know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I''m just
beggining with Xml and I''m not very experienced in Python, too.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3
XUtZ0JoHeTavBOCYi6YYnNo=
=VORM
-----END PGP SIGNATURE-----

推荐答案

Luis P. Mendes写道：

Luis P. Mendes wrote:

我得到以下结果：

<？xml version = " 1.0" encoding =" utf-8"？>
< string xmlns =" http：//www.e..">& lt; DataSet& gt;
〜 &安培; LT;订单&安培; GT;

最有可能，这个结果是正确的，你的文件确实包含

& lt;订单>

我没有得到任何元素。但是，如果我通过浏览器访问相同的URL，则浏览器窗口中的结果类似于：

< string xmlns =" http：// www .... ..">
〜< DataSet>

最有可能的是，您的浏览器不正确（或至少令人困惑），并且

渲染& lt;作为<，尽管这不是标记。

我已经浏览了网页，我知道这是关于转义字符的，但是我没有找到一个简单的解决方案。

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
Most likely, this result is correct, and your document
really does contain

<Order>

I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
Most likely, your browser is incorrect (or atleast confusing), and
renders < as "<", even though this is not markup.
I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.

不确定this是什么是。 AFAICT，一切正常。

问候，

马丁

Not sure what "this" is. AFAICT, everything works correctly.

Regards,
Martin

----- BEGIN PGP SIGNED MESSAGE -----

哈希：SHA1

这是xml文件：

< ;？xml version =" 1.0" encoding =" utf-8"？>

< string xmlns =" http：//www.e..">& lt; DataSet& gt;

~& lt;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

（...其他......）

~& lt; / Order& gt;

& lt; / DataSet& gt;< / string>

当我这样做时：

print xmldoc.toxml（）

它打印：

<？xml version =" 1.0" ？>

< string xmlns =" http：//www...">& lt; DataSet& gt;

~& lt ;订单& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

~& lt; / Order& ; gt;

& lt; / DataSet& gt;< / string>

__________________________________________________ ________

with：stringNode = xmldoc.childNodes [0]

print stringNode.toxml（）

我得到：

< string xmlns =" http：/ /www......."><DataSet>

~& lt; Order& gt;

~& lt ;客户>>& lt; /客户>

~& lt; /订单& gt;

& lt; / DataSet& ; gt;< / string>

__________________________________________________ ____________________

with：DataSetNode = stringNode.childNodes [0]

print DataSetNode .toxml（）

我得到：

& lt; DataSet& amp; gt;

~& lt; Order& gt;

~& lt; Customer& gt; 439& lt; / Customer& gt;

~& lt; / Order& gt;

& lt; / DataSet& gt;

__________________________________________________ _____________-

到目前为止一切顺利，但是当我发出命令时：

print DataSetNode.childNodes [0]

我得到：

IndexError：元组索引超出范围

为什么错误，为什么它会返回一个元组？

为什么不退货：

& lt;订单& gt;

& lt;客户> 439& lt; /客户> gt ;

& lt; / Order& gt;

??

----- BEGIN PGP SIGNATURE-- ---

版本：GnuPG v1.2.4（GNU / Linux）

评论：使用GnuPG和Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCF D / hps8ybQli8HAs3iSCvRjwqjACfS / 12

5gctpB91S5cy299e / TVLGQk =

= XR2a

----- END PGP SIGNATURE -----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>

When I do:

print xmldoc.toxml()

it prints:
<?xml version="1.0" ?>
<string xmlns="http://www..."><DataSet>
~ <Order>
~ <Customer>439</Customer>

~ </Order>
</DataSet></string>

__________________________________________________ ________
with: stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
I get:
<string xmlns="http://www......."><DataSet>
~ <Order>
~ <Customer>439</Customer>

~ </Order>
</DataSet></string>
__________________________________________________ ____________________

with: DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()

I get:

<DataSet>
~ <Order>
~ <Customer>439</Customer>

~ </Order>
</DataSet>
__________________________________________________ _____________-

so far so good, but when I issue the command:

print DataSetNode.childNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?
Why doesn''t it return:
<Order>
<Customer>439</Customer>

</Order>
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
5gctpB91S5cy299e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----

Luis P. Mendes写道：

Luis P. Mendes wrote:

-----开始PGP签名消息-----
哈希：SHA1

这是xml文档：

<？xml version =" 1.0" encoding =" utf-8"？>
< string xmlns =" http：//www.e..">& lt; DataSet& gt;
〜 & lt; Order& gt;
〜& lt; Customer& gt; 439& lt; / Customer& gt;
（...其他......）
〜& lt; / Order& gt;
& lt; / DataSet& gt;< / string>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>

这是一个包含单个标签的XML文档，< string>，其内容为包含

实体转义XML的文本。

这不是*包含标签的XML文档< DataSet> ，<订单>，<客户>等

您看到的所有行为都是这样的结果。你需要忘记

< string>的内容。标签能够将其视为结构化XML。

Kent

This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.

Kent

这篇关于xml解析转义字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

xml解析转义字符 [英] xml parsing escape characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

xml解析转义字符 [英] xml parsing escape characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭