lxml/python使用CDATA部分读取xml [英] lxml/python reading xml with CDATA section
问题描述
在我的xml中,有一个 CDATA
部分.我想保留CDATA部分,然后剥离它.有人可以提供以下帮助吗?
In my xml I have a CDATA
section. I want to keep the CDATA part, and then strip it. Can someone help with the following?
默认设置无效:
$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
这篇文章似乎建议使用 parser
选项 strip_cdata = False
可以保留cdata,但无效:
This post seems to suggest that a parser
option strip_cdata=False
may keep the cdata, but it has no effect:
$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
使用默认值为 strip_cdata = True
的结果相同:
Using strip_cdata=True
, which should be the default, yields the same:
$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
推荐答案
CDATA节不会保留在元素的 text
属性中,即使 strip_cdata = False
是如您所注意到的,在解析XML内容时使用.请参见 https://lxml.de/api.html#cdata .
CDATA sections are not preserved in the text
property of an element, even if strip_cdata=False
is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.
CDATA部分 :
-
使用
tostring()
进行序列化时:
print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())
写入文件时:
When writing to a file:
tree.write("subject.xml", encoding="UTF-8")
这篇关于lxml/python使用CDATA部分读取xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!