xml.dom.minidom:获取 CDATA 值 [英] xml.dom.minidom: Getting CDATA values

查看:24
本文介绍了xml.dom.minidom:获取 CDATA 值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以获取图像标签中的值(请参阅下面的 XML),但不能获取类别标签中的值.区别在于一个是 CDATA 部分,另一个只是一个字符串.任何帮助将不胜感激.

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])

推荐答案

p.getElementsByTagName('Category')[0].firstChild

p.getElementsByTagName('Category')[0].firstChild

minidom 不会将 <![CDATA[ 部分扁平化为纯文本,而是将它们保留为 DOM CDATASection 节点.(可以说它应该,至少是可选的.DOM Level 3 LS 默认将它们展平,这是值得的,但 minidom 比 DOM L3 老得多.)

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

因此 Category 的 firstChild 是一个 Text 节点,表示 开始标记和 CDATA 部分开头之间的空白.它有两个兄弟节点:CDATASection 节点和另一个尾随空白文本节点.

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

您可能想要的是 Category 的所有子项的文本数据.在 DOM Level 3 Core 中,您只需调用:

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

p.getElementsByTagName('Category')[0].textContent

但是 minidom 还不支持.但是,最近的版本确实支持另一种级别 3 方法,您可以使用它以更迂回的方式执行相同的操作:

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText

这篇关于xml.dom.minidom:获取 CDATA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆