xml.dom.minidom:获取 CDATA 值 [英] xml.dom.minidom: Getting CDATA values

查看：24 发布时间：2021/10/1 19:07:06 python xml

本文介绍了xml.dom.minidom:获取 CDATA 值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以获取图像标签中的值(请参阅下面的 XML)，但不能获取类别标签中的值.区别在于一个是 CDATA 部分，另一个只是一个字符串.任何帮助将不胜感激.

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])

推荐答案

p.getElementsByTagName('Category')[0].firstChild

minidom 不会将 <![CDATA[ 部分扁平化为纯文本，而是将它们保留为 DOM CDATASection 节点.(可以说它应该，至少是可选的.DOM Level 3 LS 默认将它们展平，这是值得的，但 minidom 比 DOM L3 老得多.)

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

因此 Category 的 firstChild 是一个 Text 节点，表示开始标记和 CDATA 部分开头之间的空白.它有两个兄弟节点:CDATASection 节点和另一个尾随空白文本节点.

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

您可能想要的是 Category 的所有子项的文本数据.在 DOM Level 3 Core 中，您只需调用:

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

p.getElementsByTagName('Category')[0].textContent

但是 minidom 还不支持.但是，最近的版本确实支持另一种级别 3 方法，您可以使用它以更迂回的方式执行相同的操作:

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText

这篇关于xml.dom.minidom:获取 CDATA 值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

xml.dom.minidom:获取 CDATA 值 [英] xml.dom.minidom: Getting CDATA values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

xml.dom.minidom:获取 CDATA 值 [英] xml.dom.minidom: Getting CDATA values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭