使用 Python 和 minidom 进行 XML 解析 [英] XML Parsing with Python and minidom

查看:34
本文介绍了使用 Python 和 minidom 进行 XML 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python (minidom) 来解析一个 XML 文件,该文件打印出类似于这样的层次结构(此处使用缩进来显示重要的层次关系):

I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

相反,程序在节点上迭代多次并产生以下打印重复节点.(在每次迭代时查看节点列表,很明显为什么它会这样做,但我似乎无法找到获取我正在寻找的节点列表的方法.)

Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

这里是 XML 源文件:

Here is the XML source file:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

这是 Python 程序:

Here is the Python program:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

我可以通过不嵌套主题"元素来解决问题,方法是将较低级别的主题名称更改为SubTopic1"和SubTopic2"之类的名称.但是,我想利用内置的 XML 层次结构而不需要不同的元素名称;似乎我应该能够嵌套主题"元素,并且应该有某种方法可以知道我当前正在查看哪个级别的主题".

I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.

我尝试了许多不同的 XPath 函数,但都没有成功.

I've tried a number of different XPath functions without much success.

推荐答案

getElementsByTagName 是递归的,您将获得所有 具有匹配 tagName 的后代.由于您的主题包含其他也有标题的主题,因此调用将多次获得较低的标题.

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

如果您只想请求所有匹配的直接子级,而您没有可用的 XPath,则可以编写一个简单的过滤器,例如:

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

这篇关于使用 Python 和 minidom 进行 XML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆