将DATEXII XML文件转换为Python中的DataFrame [英] DATEXII XML file to DataFrame in Python

查看:78
本文介绍了将DATEXII XML文件转换为Python中的DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近几天,我一直试图打开并读取某个XML文件(DATEXII格式),但到目前为止还没有成功.它与 NDW开放数据网站(荷兰道路和交通数据数据库)的超链接有关, XML文件的源.树的头部类似于此图片中的 ,并继续

The last couple of days I have been trying to open and read a certain XML file (in DATEXII format), but have not succeeded so far. It is about traffic data from the NDW Open Data website (Dutch Databank for Road and Traffic Data), hyperlink for the source of the XML files. The head of the tree is like in this picture and continues like this, see also snippet below. Though these together only form a very small part of the data.

<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Header/> -
  <soapenv:Body>
    -
    <d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      -
      <exchange xmlns="http://datex2.eu/schema/2/2_0">
        -
        <supplierIdentification>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </supplierIdentification>
      </exchange>
      -
      <payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
        <publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
        -
        <publicationCreator>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </publicationCreator>
        <measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
        <headerInformation>
          <confidentiality>noRestriction</confidentiality>
          <informationStatus>real</informationStatus>
        </headerInformation>
        -
        <siteMeasurements>
          <measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
          <measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
          -
          <measuredValue index="1">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="2">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="3">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="4">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="5">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="1">
                  <speed>38</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="6">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="0">
                  <speed>-1</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="7">

理想情况下,我想使用Jupyter Notebook中的Python将数据作为DataFrame加载,因此,如果数据允许,我可以执行一些预测性分析.我已经尝试了ElementTree,像这样的lxml,并从许多其他线程中得到启发:

Ideally I would want to load the information with Python in a Jupyter Notebook as DataFrame, so I can perform some predictive analytics if the data allows. I have tried it with ElementTree, lxml like this, inspired from numerous other threads:

# Standard Packages
import pandas as pd
import numpy as np

# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml

os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")

xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = [] #This is our record list which we will convert into a 
    dataframe
    for i, child in enumerate(root): #Begin looping through our root tree
        record = {} #Place holder for our record
        for subchild in child: #iterate through the subchildren
            record[subchild.tag] = subchild.text #Extract the text create a new 
    dictionary key, value pair
        all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame

print(xml2df(xml_file))

尽管这只会返回第一行的单个条目,例如列名:d2LogicalModel,行:0,条目:无.

Though this only returns one single entry with the first line, like column name: d2LogicalModel, row: 0, entry: None.

在Microsoft Edge中,我能够很困难地看到树状结构,需要大量的CPU(Notepad ++和插件XMLtools也足够了,但是当文件更大"时崩溃,即> 20mb).但是,我认为,这种结构仍然很难理解.层太多了,我不知道如何用正确的子子级等来定义xml2df().

I was able to see the tree like structure with difficulty in Microsoft Edge, requiring a lot of the CPU (Notepad++ and the plugin XMLtools also sufficed, but crashes with "bigger" size files, i.e. > 20mb). Though, in my opinion, this structure was still difficult to comprehend. There are so many layers that I do not know how to define the xml2df() with the correct sub-subchilds etc.

因此,我的问题归结为,首先,我将如何识别带有数据的变量/列?借此获得我要导入的相关数据的概述.其次,如何将其导入到DataFrame中?

My questions thus boils down to, first of all, how would I be able to identify the variables/columns with data? Herewith getting an overview of the relevant data that I want to import. And secondly, how to import this into a DataFrame?

注意:由于DATEXII格式是欧洲交通数据的标准,我希望他们的指南会有所帮助(请参见文档),但对我来说还没有意义.也许他们会对你们中的任何一个人:)

Note: Since the DATEXII format is the standard for traffic data in Europe, I was hoping their guides would help (see documents), but they haven't made sense to me yet. Maybe they will to any of you :)

非常感谢您的帮助!

推荐答案

请考虑使用 XSLT 一种专用的转换语言,旨在将XML文件转换为其他XML,HTML甚至文本(CSV/TAB).因此,请考虑下面的XSLT,它将原始XML转换为表格格式的逗号分隔值,以便使用read_csv()导入到熊猫:

Consider transforming your nested XML input source into a flatter structure using XSLT the special-purpose transformation language designed to transform XML files into other XML, HTML, even text (CSV/TAB). Therefore, consider the below XSLT that transforms original XML into comma-separated values in tabular format for import into pandas with read_csv():

XSLT (另存为.xsl文件,一个特殊的xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:pub="http://datex2.eu/schema/2/2_0"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/soapenv:Envelope">
    <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
    <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
    <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="soapenv:Body"/>
  </xsl:template>

  <xsl:template match="soapenv:Body">
    <xsl:apply-templates select="d2LogicalModel"/>
  </xsl:template>

  <xsl:template match="d2LogicalModel">
    <xsl:apply-templates select="pub:payloadPublication"/>
  </xsl:template>

  <xsl:template match="pub:payloadPublication">
    <xsl:apply-templates select="pub:siteMeasurements"/>
  </xsl:template>

  <xsl:template match="pub:siteMeasurements">
    <xsl:apply-templates select="pub:measuredValue"/>
  </xsl:template>

  <xsl:template match="pub:measuredValue">
    <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                 ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                 @index,',',
                                 pub:measuredValue/pub:basicData/@xsi:type,',',
                                 descendant::pub:vehicleFlowRate,',',
                                 descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                 descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
  </xsl:template>

</xsl:stylesheet>

Python

from io import StringIO
import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')

# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING 
result = str(transform(doc))

# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))

输出 (父节点值成为具有不同数字数据的重复指示器)

print(df)

#           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
# 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
# 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
# 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
# 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
# 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
# 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0

这篇关于将DATEXII XML文件转换为Python中的DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆