Python读取带有相关子元素的xml [英] Python read xml with related child elements

查看:56
本文介绍了Python读取带有相关子元素的xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的xml文件:

I have a xml file with this structure:

<?DOMParser ?> 
<logbook:LogBook xmlns:logbook="http://www/logbook/1.0"  version="1.2">
<product>
    <serialNumber value="764000606"/>
</product>
<visits>
<visit>
    <general>
        <startDateTime>2014-01-10T12:22:39.166Z</startDateTime>
        <endDateTime>2014-03-11T13:51:31.480Z</endDateTime>
    </general>
    <parts>
        <part number="03081" name="WSSA" index="0016"/>
    </parts>
</visit>
<visit>
<general>
    <startDateTime>2013-01-10T12:22:39.166Z</startDateTime>
    <endDateTime>2013-03-11T13:51:31.480Z</endDateTime>
</general>
<parts>
    <part number="02081" name="PSSF" index="0017"/>
</parts>
</visit>
</visits>
</logbook:LogBook>

我想从此xml获得两个输出:

I want to have two outputs from this xml:

1-访问包括序列号,所以我这样写:

1- visit including the serial Number, so I wrote:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root=tree.getroot()
visits=pd.DataFrame()
for general in root.iter('general'):
    for child in root.iter('serialNumber'):
        visits=visits.append({'startDateTime':general.find('startDateTime').text ,
                  'endDateTime': general.find('endDateTime').text, 'serialNumber':child.attrib['value'] }, ignore_index=True)

此代码的输出如下数据框:

The output of this code is following dataframe:

serialNumber | startDateTime          | endDateTime            
-------------|------------------------|------------------------|
 764000606   |2014-01-10T12:22:39.166Z|2014-03-11T13:51:31.480Z|
 764000606   |2013-03-11T13:51:31.480Z|2013-01-10T12:22:39.166Z|

2-零件

对于 parts ,我想要以下输出,即通过 startDateTime 区分访问,我想显示与每次访问相关的部分:

For parts, I want to have the following output, in a way that I distinguish visits from each other by startDateTime and I want to show the parts related to the each visit :

 serialNumber | startDateTime|number|name|index|
 -------------|--------------|------|----|-----|

我写的部分:

parts=pd.DataFrame()
for part in root.iter('part'):
    for child in root.iter('serialNumber'):
            parts=parts.append({'index':part.attrib['index'],
                        'znumber':part.attrib['number'],
                        'name': part.attrib['name'], 'serialNumber':child.attrib['value'], 'startDateTime':general.find('startDateTime').text}, ignore_index=True)

这是我从这段代码中得到的:

This is what I get from this code:

 index |name|serialNumber| startDateTime          |znumber|
 ------|----|------------|------------------------|-------|
 0016  |WSSA|  764000606 |2013-01-10T12:22:39.166Z| 03081 |
 0017  |PSSF|  764000606 |2013-01-10T12:22:39.166Z| 02081 |

同时我想要这个:看看 startDateTime

While i want this: look at startDateTime:

 index |name|serialNumber| startDateTime          |znumber|
 ------|----|------------|------------------------|-------|
 0016  |WSSA|  764000606 |2014-01-10T12:22:39.166Z| 03081 |
 0017  |PSSF|  764000606 |2013-01-10T12:22:39.166Z| 02081 |

有什么想法吗?
我正在使用XML ElementTree

推荐答案

下面是一个示例,该示例从 xml

Here's an example that gets the data from xml.

code.py

#!/usr/bin/env python3

import sys
import xml.etree.ElementTree as ET
from pprint import pprint as pp


file_name = "a.xml"


def get_product_sn(product_node):
    for product_node_child in list(product_node):
        if product_node_child.tag == "serialNumber":
            return product_node_child.attrib.get("value", None)
    return None


def get_parts_data(parts_node):
    ret = list()
    for parts_node_child in list(parts_node):
        attrs = parts_node_child.attrib
        ret.append({"number": attrs.get("number", None), "name": attrs.get("name", None), "index": attrs.get("index", None)})
    return ret


def get_visit_node_data(visit_node):
    ret = dict()
    for visit_node_child in list(visit_node):
        if visit_node_child.tag == "general":
            for general_node_child in list(visit_node_child):
                if general_node_child.tag == "startDateTime":
                    ret["startDateTime"] = general_node_child.text
                elif general_node_child.tag == "endDateTime":
                    ret["endDateTime"] = general_node_child.text
        elif visit_node_child.tag == "parts":
            ret["parts"] = get_parts_data(visit_node_child)
    return ret


def get_node_data(node):
    ret = {"visits": list()}
    for node_child in list(node):
        if node_child.tag == "product":
            ret["serialNumber"] = get_product_sn(node_child)
        elif node_child.tag == "visits":
            for visits_node_child in list(node_child):
                ret["visits"].append(get_visit_node_data(visits_node_child))
    return ret


def main():
    tree = ET.parse(file_name)
    root_node = tree.getroot()
    data = get_node_data(root_node)
    pp(data)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

注释


  • 它处理 xml 以树状方式显示,因此它会映射(如果您愿意)在 xml 上(如果 xml 结构发生变化,则代码也应进行调整)

  • 通常设计为: get_node_data 可以在具有两个子节点的节点上调用: product visits 。在我们的例子中,它是根节点本身,但是在现实世界中,可能会有一系列这样的节点序列,每个节点都带有我上面列出的2个子节点。

  • 它被设计为易于错误处理,因此如果 xml 不完整,它将获取尽可能多的数据;我选择这种(贪婪的)方法,而不是遇到错误时会抛出异常

  • 因为我没有使用 pandas ,填充对象我只是返回一个 Python 字典 json );我认为将其转换为 DataFrame 并不难

  • 我已经在 Python 2.7 Python中运行了它3.5

  • It treats the xml in a tree-like manner, so it maps (if you will) on the xml (if the xml structure changes, the code should be adapted as well)
  • It's designed to be general: get_node_data could be called on a node that has 2 children: product and visits. In our case it's the root node itself, but in the real world there could be a sequence of such nodes each with the 2 children that I listed above
  • It's designed to be error-friendly so if the xml is incomplete, it will get as much data as it can; I chose this (greedy) approach over the one that when it encounters an error it simply throws an exception
  • As I didn't work with pandas, instead of populating the object I simply return a Python dictionary (json); I think converting it to a DataFrame shouldn't be hard
  • I've run it with Python 2.7 and Python 3.5

输出(包含2个键的字典)-出于可读性的考虑而缩进:

The output (a dictionary containing 2 keys) - indented for readability:


  • serialNumber -序列号(显然)

  • visit (自这是一本字典,我不得不将此数据放在键下)-字典列表,每个字典包含来自 visit 节点

  • serialNumber - the serial number (obviously)
  • visits (since it's a dictionary, I had to place this data "under" a key) - a list of dictionaries each containing data from a visit node

输出


(py_064_03.05.04_test0) e:\Work\Dev\StackOverflow\q045049761>"e:\Work\Dev\VEnvs\py_064_03.05.04_test0\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

{'serialNumber': '764000606',
 'visits': [{'endDateTime': '2014-03-11T13:51:31.480Z',
             'parts': [{'index': '0016', 'name': 'WSSA', 'number': '03081'}],
             'startDateTime': '2014-01-10T12:22:39.166Z'},
            {'endDateTime': '2013-03-11T13:51:31.480Z',
             'parts': [{'index': '0017', 'name': 'PSSF', 'number': '02081'}],
             'startDateTime': '2013-01-10T12:22:39.166Z'}]}


< br>

@ EDIT0 :按一个请求添加了多个 part 节点处理评论。该功能已移至 get_parts_data 。现在, visits 列表中的每个条目都将具有一个 parts 键,其键值将是一个列表,该列表由从每个 part 节点中提取的字典组成(不是所提供的 xml 的大小写)。

@EDIT0: added multiple part node handling as requested in one of the comments. That functionality has been moved to get_parts_data. Now, each entry in the visits list will have a parts key whose value will be a list consisting of dictionaries extracted from each part node (not the case for the provided xml).

这篇关于Python读取带有相关子元素的xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆