xml解析转义字符 [英] xml parsing escape characters
问题描述
-----开始PGP签名消息-----
哈希:SHA1
我只知道一点xml,我正在尝试解析一个xml文件
,以便将其元素保存在一个文件中(列表中的字典)。
当我在Linux下运行的python 2.3.3中使用
以下行访问URL时:
resposta = urllib.urlopen(url )
xmldoc = minidom.parse(resposta)
resposta.close()
我得到以下结果:
<?xml version =" 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
~& lt;订单& gt;
~& lt; Customer& gt; 439& lt; / Customer& gt;
(...其他......)
~& lt; / Order& gt;
& lt; / DataSet& gt;< / string> < br $>
__________________________________________________ ___________
在下面的行中,我尝试从字符串中获取所有子节点,首先通过计算它们来计算
,然后忽略/ n个:
stringNode = xmldoc.childNodes [0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes [0]
numNos = len(dataSetNode.childNodes)
todosNos = {}
for no in range(numNos ):
todosNos [no] = dataSetNode.childNodes [no] .toxml()
posicaoXml = [否则todosNos.keys()中的否如果len(todosNos [没有])> 4]
打印posicaoXml
(我'我几乎可以肯定有一种更简单的方法可以做到这一点......)
__________________________________________________ ___________
我没有得到任何元素。但是,如果我通过浏览器访问相同的URL,
浏览器窗口中的结果如下:
< string xmlns =" http ://www....">
~< DataSet>
~<订单>
~< ;客户> 439< /客户>
(......其他......)
~< /订单>
~< / DataSet>
< / string>
和我发布的行按预期工作。
我我已经浏览了网页,我知道它是关于转义字符的,但是
我没有找到一个简单的解决方案。
我试图使用LL2XML.py和unescape函数进行简单的替换
text = text.replace("& lt;","<")
但是我必须将xml文档转换为字符串然后我不能(或者不知道)如何将它转换回xml对象。
我该如何解决这个问题?请解释一下,我只是说我只是用bml打招呼,而且我对Python也不是很有经验。
Luis
----- BEGIN PGP SIGNATURE -----
版本:GnuPG v1.2.4(GNU / Linux)
评论:使用GnuPG和Thunderbird - http://enigmail.mozdev.org
iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3
XUtZ0JoHeTavBOCYi6YYnNo =
= VORM
----- END PGP SIGNATURE -----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
I only know a little bit of xml and I''m trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).
When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen(url)
xmldoc = minidom.parse(resposta)
resposta.close()
I get the following result:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>
__________________________________________________ ___________
In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:
stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes[0]
numNos = len(dataSetNode.childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.childNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml
(I''m almost sure there''s a simpler way to do this...)
__________________________________________________ ___________
I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:
<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>
and the lines I posted work as intended.
I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.
I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("<", "<")
but I had to convert the xml document to string and then I could not (or
don''t know) how to convert it back to xml object.
How can I solve this? Please, explain it having in mind that I''m just
beggining with Xml and I''m not very experienced in Python, too.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdm NJlEPSeQCgteB3
XUtZ0JoHeTavBOCYi6YYnNo=
=VORM
-----END PGP SIGNATURE-----
推荐答案
Luis P. Mendes写道:
Luis P. Mendes wrote:
我得到以下结果:
<?xml version = " 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
〜 &安培; LT;订单&安培; GT;
最有可能,这个结果是正确的,你的文件确实包含
& lt;订单>
我没有得到任何元素。但是,如果我通过浏览器访问相同的URL,则浏览器窗口中的结果类似于:
< string xmlns =" http:// www .... ..">
〜< DataSet>
最有可能的是,您的浏览器不正确(或至少令人困惑),并且
渲染& lt;作为<,尽管这不是标记。
我已经浏览了网页,我知道这是关于转义字符的,但是我没有找到一个简单的解决方案。
I get the following result:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
Most likely, this result is correct, and your document
really does contain
<Order>
I don''t get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:
<string xmlns="http://www......">
~ <DataSet>
Most likely, your browser is incorrect (or atleast confusing), and
renders < as "<", even though this is not markup.
I already browsed the web, I know it''s about the escape characters, but
I didn''t find a simple solution for this.
不确定this是什么是。 AFAICT,一切正常。
问候,
马丁
Not sure what "this" is. AFAICT, everything works correctly.
Regards,
Martin
----- BEGIN PGP SIGNED MESSAGE -----
哈希:SHA1
这是xml文件:
< ;?xml version =" 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
~& lt;订单& gt;
~& lt; Customer& gt; 439& lt; / Customer& gt;
(...其他......)
~& lt; / Order& gt;
& lt; / DataSet& gt;< / string>
当我这样做时:
print xmldoc.toxml()
它打印:
<?xml version =" 1.0" ?>
< string xmlns =" http://www...">& lt; DataSet& gt;
~& lt ;订单& gt;
~& lt; Customer& gt; 439& lt; / Customer& gt;
~& lt; / Order& ; gt;
& lt; / DataSet& gt;< / string>
__________________________________________________ ________
with:stringNode = xmldoc.childNodes [0]
print stringNode.toxml()
我得到:
< string xmlns =" http:/ /www......."><DataSet>
~& lt; Order& gt;
~& lt ;客户>>& lt; /客户>
~& lt; /订单& gt;
& lt; / DataSet& ; gt;< / string>
__________________________________________________ ____________________
with:DataSetNode = stringNode.childNodes [0]
print DataSetNode .toxml()
我得到:
& lt; DataSet& amp; gt;
~& lt; Order& gt;
~& lt; Customer& gt; 439& lt; / Customer& gt;
~& lt; / Order& gt;
& lt; / DataSet& gt;
__________________________________________________ _____________-
到目前为止一切顺利,但是当我发出命令时:
print DataSetNode.childNodes [0]
我得到:
IndexError:元组索引超出范围
为什么错误,为什么它会返回一个元组?
为什么不退货:
& lt;订单& gt;
& lt;客户> 439& lt; /客户> gt ;
& lt; / Order& gt;
??
----- BEGIN PGP SIGNATURE-- ---
版本:GnuPG v1.2.4(GNU / Linux)
评论:使用GnuPG和Thunderbird - http://enigmail.mozdev.org
iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCF D / hps8ybQli8HAs3iSCvRjwqjACfS / 12
5gctpB91S5cy299e / TVLGQk =
= XR2a
----- END PGP SIGNATURE -----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
this is the xml document:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>
When I do:
print xmldoc.toxml()
it prints:
<?xml version="1.0" ?>
<string xmlns="http://www..."><DataSet>
~ <Order>
~ <Customer>439</Customer>
~ </Order>
</DataSet></string>
__________________________________________________ ________
with: stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
I get:
<string xmlns="http://www......."><DataSet>
~ <Order>
~ <Customer>439</Customer>
~ </Order>
</DataSet></string>
__________________________________________________ ____________________
with: DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()
I get:
<DataSet>
~ <Order>
~ <Customer>439</Customer>
~ </Order>
</DataSet>
__________________________________________________ _____________-
so far so good, but when I issue the command:
print DataSetNode.childNodes[0]
I get:
IndexError: tuple index out of range
Why the error, and why does it return a tuple?
Why doesn''t it return:
<Order>
<Customer>439</Customer>
</Order>
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
5gctpB91S5cy299e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----
Luis P. Mendes写道:
Luis P. Mendes wrote:
-----开始PGP签名消息-----
哈希:SHA1
这是xml文档:
<?xml version =" 1.0" encoding =" utf-8"?>
< string xmlns =" http://www.e..">& lt; DataSet& gt;
〜 & lt; Order& gt;
〜& lt; Customer& gt; 439& lt; / Customer& gt;
(...其他......)
〜& lt; / Order& gt;
& lt; / DataSet& gt;< / string>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
this is the xml document:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......"><DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
</DataSet></string>
这是一个包含单个标签的XML文档,< string>,其内容为包含
实体转义XML的文本。
这不是*包含标签的XML文档< DataSet> ,<订单>,<客户>等
您看到的所有行为都是这样的结果。你需要忘记
< string>的内容。标签能够将其视为结构化XML。
Kent
This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.
This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.
All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.
Kent
这篇关于xml解析转义字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!