即使使用strip_cdata = False,CDATA也仍会在lxml中剥离 [英] CDATA getting stripped in lxml even after using strip_cdata=False

查看:80
本文介绍了即使使用strip_cdata = False,CDATA也仍会在lxml中剥离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要求,我需要读取XML文件并将字符串替换为特定值. XML包含CDATA元素,我需要保留它. 我尝试使用解析器并将strip_data设置为false.这是行不通的,需要帮助找出解决方法.

I have a requirement in which I need to read a XML file and replace a string with a certain value. The XML contains CDATA element and I need to preserve it. I have tried using parser and setting strip_data to false. This is not working and need help to figure out a way to achieve it.

import lxml.etree as ET

parser1 = ET.XMLParser(strip_cdata=False)

with open('testxml.xml', encoding="utf8") as f:
tree = ET.parse(f, parser=parser1)

root = tree.getroot()
for elem in root.getiterator():
    try:
        elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')
    except AttributeError:
        pass

tree.write('output_new8.xml', xml_declaration=True, method='xml',  encoding="utf8")

下面是示例xml:

     <?xml version="1.0" encoding="UTF-8" standalone="no"?><!-- Copyright   (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information -->
<Benchmark>
       <status date="2013-03-11">draft</status>
    <title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
    <description>Random discription</description>
    <version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
        <model system="urn:xccdf:scoring:default"/>
    <Profile id="xccdf_com.Moto_profile_release_4.0.21">
        <status date="2016-03-30">draft</status>
        <title>RCM 4.0.21</title>
        <description><![CDATA[<p>Moto Vblock System 300 Release 4.0.21</p>
<ul><li> TMM VNX OE for File was updated to 7.1.79.8.</li>
</ul>]]>
</description>
        <set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
        <set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
        <set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
        <set-value idref="xccdf_com.Moto_value_powerpath_version">Bundled Manager 2.2(8b)</set-value>       
        <select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
        <select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
    </Profile>
</Benchmark>

代码的输出如下所示:

<?xml version='1.0' encoding='UTF8'?>
<!-- Copyright (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information --><Benchmark>
    <status date="2013-03-11">draft</status>
    <title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
    <description>Random discription</description>
    <version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
        <model system="urn:xccdf:scoring:default"/>
    <Profile id="xccdf_com.Moto_profile_release_4.0.21">
        <status date="2016-03-30">draft</status>
        <title>RCM 4.0.21</title>
        <description>&lt;p&gt;Moto Vblock System 300 Release 4.0.21&lt;/p&gt;
&lt;ul&gt;&lt;li&gt; TMM VNX OE for File was updated to 7.1.79.8.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
        <set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
        <set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
        <set-value idref="xccdf_com.Moto_value_powerpath_version">123456</set-value>        
        <select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
        <select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
    </Profile>
</Benchmark

>

如您所见,CDATA部分被剥离. 如果有人可以在这里帮助我,那就太好了.

As you can see , CDATA section is stripped. It will be great if someone can help me here.

推荐答案

这是因为您在做

elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')

它将CDATA替换为普通的文本节点.

which replaces the CDATA with a normal text node.

文档状态

请注意,.text属性如何不表示文本内容由CDATA节包装.如果要确保数据被CDATA块包装,则可以使用CDATA()文本包装器.

Note how the .text property does not give any indication that the text content is wrapped by a CDATA section. If you want to make sure your data is wrapped by a CDATA block, you can use the CDATA() text wrapper.

因此,如果要保留CDATA部分,则仅应在对其进行修改的情况下分配给elem.text,并指示lxml使用CDATA部分:

Therefore, if you want to keep the CDATA section, you should only assign to elem.text if you are modifying it, and instruct lxml to use a CDATA section:

if 'Bundled Manager 2.2(8b)' in elem.text:
    elem.text = ET.CDATA(elem.text.replace('Bundled Manager 2.2(8b)', '123456'))

由于ElementTree库的工作方式(整个文本和cdata内容在.text属性中串联并显示为str),所以实际上不可能知道最初是否使用CDATA. (请参见找出CDATA在lxml元素中的位置?源代码)

Due to how the ElementTree library works (the entire text and cdata content is concatenated and exposed as a str in the .text property), it's not really possible to know whether CDATA was originally used or not. (see Figuring out where CDATA is in lxml element? and the source code)

这篇关于即使使用strip_cdata = False,CDATA也仍会在lxml中剥离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆