使用 Python 和 minidom 进行 XML 解析 [英] XML Parsing with Python and minidom
问题描述
我正在使用 Python (minidom) 来解析一个 XML 文件,该文件打印出类似于这样的层次结构(此处使用缩进来显示重要的层次关系):
I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):
My Document
Overview
Basic Features
About This Software
Platforms Supported
相反,程序在节点上迭代多次并产生以下打印重复节点.(在每次迭代时查看节点列表,很明显为什么它会这样做,但我似乎无法找到获取我正在寻找的节点列表的方法.)
Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)
My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported
这里是 XML 源文件:
Here is the XML source file:
<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
<Topic Target="ALL">
<Title>My Document</Title>
</Topic>
<Topic Target="ALL">
<Title>Overview</Title>
<Topic Target="ALL">
<Title>Basic Features</Title>
</Topic>
<Topic Target="ALL">
<Title>About This Software</Title>
<Topic Target="ALL">
<Title>Platforms Supported</Title>
</Topic>
</Topic>
</Topic>
</DOCMAP>
这是 Python 程序:
Here is the Python program:
import xml.dom.minidom
from xml.dom.minidom import Node
dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
alist=node.getElementsByTagName('Title')
for a in alist:
Title= a.firstChild.data
print Title
我可以通过不嵌套主题"元素来解决问题,方法是将较低级别的主题名称更改为SubTopic1"和SubTopic2"之类的名称.但是,我想利用内置的 XML 层次结构而不需要不同的元素名称;似乎我应该能够嵌套主题"元素,并且应该有某种方法可以知道我当前正在查看哪个级别的主题".
I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.
我尝试了许多不同的 XPath 函数,但都没有成功.
I've tried a number of different XPath functions without much success.
推荐答案
getElementsByTagName 是递归的,您将获得所有 具有匹配 tagName 的后代.由于您的主题包含其他也有标题的主题,因此调用将多次获得较低的标题.
getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.
如果您只想请求所有匹配的直接子级,而您没有可用的 XPath,则可以编写一个简单的过滤器,例如:
If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:
def getChildrenByTagName(node, tagName):
for child in node.childNodes:
if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
yield child
for topic in document.getElementsByTagName('Topic'):
title= list(getChildrenByTagName('Title'))[0] # or just get(...).next()
print title.firstChild.data
这篇关于使用 Python 和 minidom 进行 XML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!