即使使用strip_cdata = False,CDATA也仍会在lxml中剥离 [英] CDATA getting stripped in lxml even after using strip_cdata=False
问题描述
我有一个要求,我需要读取XML文件并将字符串替换为特定值. XML包含CDATA元素,我需要保留它. 我尝试使用解析器并将strip_data设置为false.这是行不通的,需要帮助找出解决方法.
I have a requirement in which I need to read a XML file and replace a string with a certain value. The XML contains CDATA element and I need to preserve it. I have tried using parser and setting strip_data to false. This is not working and need help to figure out a way to achieve it.
import lxml.etree as ET
parser1 = ET.XMLParser(strip_cdata=False)
with open('testxml.xml', encoding="utf8") as f:
tree = ET.parse(f, parser=parser1)
root = tree.getroot()
for elem in root.getiterator():
try:
elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')
except AttributeError:
pass
tree.write('output_new8.xml', xml_declaration=True, method='xml', encoding="utf8")
下面是示例xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><!-- Copyright (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information -->
<Benchmark>
<status date="2013-03-11">draft</status>
<title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
<description>Random discription</description>
<version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
<model system="urn:xccdf:scoring:default"/>
<Profile id="xccdf_com.Moto_profile_release_4.0.21">
<status date="2016-03-30">draft</status>
<title>RCM 4.0.21</title>
<description><![CDATA[<p>Moto Vblock System 300 Release 4.0.21</p>
<ul><li> TMM VNX OE for File was updated to 7.1.79.8.</li>
</ul>]]>
</description>
<set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
<set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
<set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
<set-value idref="xccdf_com.Moto_value_powerpath_version">Bundled Manager 2.2(8b)</set-value>
<select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
<select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
</Profile>
</Benchmark>
代码的输出如下所示:
<?xml version='1.0' encoding='UTF8'?>
<!-- Copyright (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information --><Benchmark>
<status date="2013-03-11">draft</status>
<title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
<description>Random discription</description>
<version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
<model system="urn:xccdf:scoring:default"/>
<Profile id="xccdf_com.Moto_profile_release_4.0.21">
<status date="2016-03-30">draft</status>
<title>RCM 4.0.21</title>
<description><p>Moto Vblock System 300 Release 4.0.21</p>
<ul><li> TMM VNX OE for File was updated to 7.1.79.8.</li>
</ul>
</description>
<set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
<set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
<set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
<set-value idref="xccdf_com.Moto_value_powerpath_version">123456</set-value>
<select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
<select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
</Profile>
</Benchmark
>
如您所见,CDATA部分被剥离. 如果有人可以在这里帮助我,那就太好了.
As you can see , CDATA section is stripped. It will be great if someone can help me here.
推荐答案
这是因为您在做
elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')
它将CDATA替换为普通的文本节点.
which replaces the CDATA with a normal text node.
文档状态
请注意,
.text
属性如何不表示文本内容由CDATA节包装.如果要确保数据被CDATA块包装,则可以使用CDATA()
文本包装器.
Note how the
.text
property does not give any indication that the text content is wrapped by a CDATA section. If you want to make sure your data is wrapped by a CDATA block, you can use theCDATA()
text wrapper.
因此,如果要保留CDATA部分,则仅应在对其进行修改的情况下分配给elem.text
,并指示lxml使用CDATA部分:
Therefore, if you want to keep the CDATA section, you should only assign to elem.text
if you are modifying it, and instruct lxml to use a CDATA section:
if 'Bundled Manager 2.2(8b)' in elem.text:
elem.text = ET.CDATA(elem.text.replace('Bundled Manager 2.2(8b)', '123456'))
由于ElementTree
库的工作方式(整个文本和cdata内容在.text
属性中串联并显示为str
),所以实际上不可能知道最初是否使用CDATA. (请参见找出CDATA在lxml元素中的位置?和源代码)
Due to how the ElementTree
library works (the entire text and cdata content is concatenated and exposed as a str
in the .text
property), it's not really possible to know whether CDATA was originally used or not. (see Figuring out where CDATA is in lxml element? and the source code)
这篇关于即使使用strip_cdata = False,CDATA也仍会在lxml中剥离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!