xml.dom.minidom:获取 CDATA 值 [英] xml.dom.minidom: Getting CDATA values
问题描述
我可以获取图像标签中的值(请参阅下面的 XML),但不能获取类别标签中的值.区别在于一个是 CDATA 部分,另一个只是一个字符串.任何帮助将不胜感激.
I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.
from xml.dom import minidom
xml = """<?xml version="1.0" ?>
<ProductData>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471195.jpg
</Image>
</ITEM>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471196.jpg
</Image>
</ITEM>
</ProductData>
"""
bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
try:
part_id = p.attributes['Id'].value.strip()
except(KeyError):
bad_xml_item_count += 1
continue
if not part_id:
bad_xml_item_count += 1
continue
part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
print '\t'.join([part_id, part_category, part_image])
推荐答案
p.getElementsByTagName('Category')[0].firstChild
p.getElementsByTagName('Category')[0].firstChild
minidom 不会将 <![CDATA[ 部分扁平化为纯文本,而是将它们保留为 DOM CDATASection 节点.(可以说它应该,至少是可选的.DOM Level 3 LS 默认将它们展平,这是值得的,但 minidom 比 DOM L3 老得多.)
minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)
因此 Category 的 firstChild 是一个 Text 节点,表示
So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.
您可能想要的是 Category 的所有子项的文本数据.在 DOM Level 3 Core 中,您只需调用:
What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:
p.getElementsByTagName('Category')[0].textContent
但是 minidom 还不支持.但是,最近的版本确实支持另一种级别 3 方法,您可以使用它以更迂回的方式执行相同的操作:
but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:
p.getElementsByTagName('Category')[0].firstChild.wholeText
这篇关于xml.dom.minidom:获取 CDATA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!