如何使用Python正确解析父/子XML [英] How to properly parse parent/child XML with Python

查看:45
本文介绍了如何使用Python正确解析父/子XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近几天我一直在处理XML解析问题,但我无法弄清.我已经使用了Python内置的ElementTree以及LXML库,但是得到了相同的结果.如果可以的话,我想继续使用ElementTree,但是如果该库有限制,那么LXML可以.请参见以下XML示例.我正在尝试做的是找到一个连接元素,然后查看该元素包含哪些类.我期望每个连接至少包含一个类.如果它没有至少一门课,我想知道它没有.我面临的问题是我的代码正在为每个连接返回文档中的所有类,而不仅仅是针对该特定连接的类.

I have a XML parsing issue that I have been working on for the last few days and I just can't figure it out. I've used both the ElementTree built-in to Python as well as the LXML libraries but get the same results. I would like to continue using ElementTree if I can, but if there are limitations to that library then LXML would do. Please see the following XML example. What I am trying to do is find a connection element and see what classes that element contains. I am expecting each connection to contain at least one class. If it doesn't have at least one class I want to know that it doesn't. The problem I am facing is that my code is returning ALL THE CLASSES in the document for each connection, instead of only the classes for that specific connection.

<test>
  <connections>
    <connection>
      <id>10</id>
      <classes>
        <class>
          <classname>DVD</classname>
        </class>
        <class>
          <classname>DVD_TEST</classname>
        </class>
      </classes>
    </connection>
    <connection>
      <id>20</id>
      <classes>
        <class>
          <classname>TV</classname>
        </class>
      </classes>
    </connection>
  </connections>
</test>

例如,这是我的Python代码及其返回的输出:

For example, here is my Python code and the output that it returns:

            for parentConnection in elemetTree.getiterator('connection'):
                # print parentConnection.tag
                for childConnection in parentConnection:
                    # print childConnection.text
                    if childConnection.tag == 'id':
                        connID = childConnection.text
                        print connID
                for p in tree.xpath('./connections/connection/classes/class'):
                    for attrib in p.attrib:
                        print '@' + attrib + '=' + p.attrib[attrib]

                    children = p.getchildren()
                    for child in children:
                        print child.text

以下是输出:

10
DVD
DVD_TEST
电视

10
DVD
DVD_TEST
TV

20
DVD
DVD_TEST
电视

20
DVD
DVD_TEST
TV

如您所见,我正在打印出CONNECTION ID的文本,然后是每个CLASSNAME的文本.但是,如您所见,它们都为CLASSNAME包含相同的文本.输出实际上应该是这样的:

As you can see, I am printing out the text of the CONNECTION ID and then the text for each CLASSNAME. However, as you can see, they both contain the same text for CLASSNAME. The output should really look like this:

10
DVD
DVD_TEST

10
DVD
DVD_TEST

20
电视

20
TV

现在,如上面的手工修改示例所示,每个连接ID(父级)都具有适当的类/类名(子级).我只是不知道如何使这项工作.如果你们中的任何人有知识完成这项工作,我很想听听.

Now as the above hand modified example shows each connection ID (Parent) has the appropriate classes/classnames (children). I just can't figure out how to make this work. If any of you have the knowledge to make this work, I would love to hear it.

我曾尝试在此论坛上构建数据结构和其他示例,但无法使其正常工作.

I've tried building a data structure and other examples on this forum but just can't get it to work right.

推荐答案

我的解决方案不使用 xpath.我建议进一步研究 lxml 文档.可能会有更优雅,更直接的方法来实现这一目标.有很多值得探索的地方!

My solution without using xpath. What I recommend is digging a little further into lxml documentation. There might be more elegant and direct ways to achieve this. There's a lot to explore!.

解决方案:

from lxml import etree
from io import BytesIO


class FindClasses(object):
    @staticmethod
    def parse_xml(xml_string):
        parser = etree.XMLParser()
        fs = etree.parse(BytesIO(xml_string), parser)
        fstring = etree.tostring(fs, pretty_print=True)
        element = etree.fromstring(fstring)
        return element

    def find(self, xml_string):
        for parent in self.parse_xml(xml_string).getiterator('connection'):
            for child in parent:
                if child.tag == 'id':
                    print child.text
                    self.find_classes(child)

    @staticmethod
    def find_classes(child):
        for parent in child.getparent():  # traversing up -> connection
            for children in parent.getchildren():  # children of connection -> classes
                for child in children.getchildren():  # child of classes -> class
                    print child.text
        print

if __name__ == '__main__':
    xml_file = open('foo.xml', 'rb')  #foo.xml or path to your xml file
    xml = xml_file.read()
    f = FindClasses()
    f.find(xml)

输出:

10
DVD
DVD_TEST

20
TV

这篇关于如何使用Python正确解析父/子XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆