使用Python将数据从XML树提取到pandas/csv中 [英] Extracting data from XML tree into pandas/csv with Python

查看:156
本文介绍了使用Python将数据从XML树提取到pandas/csv中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对某些XML文件有问题. 关于数据,我不能说太多,因为它是用于工作的,我不想惹麻烦!从一个巨大的XML文件(123091行代码)中,我只需要7个标签中的数据(如果有这样的话). 我正在尝试提取特定数据,但是在尝试存储到熊猫或csv中时遇到了一些情况. 我找到了一种获取一些信息的方法,例如:

I have an issue with some XML files. I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense). I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv. I have found a method to take some information out, like:

for info in root.iter('ArtistName'):
   print(info.text)

上面的代码将为我提供来自该XML标签的数据中的艺术家.这是我的Jupyter Notebook的一小部分,带有上述代码行的输出:

The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:

Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara

然后,我遇到了问题,因为在下面的代码中,我不能说得更好,或者说得更好,我不知道如何遍历XML中的每个标记以提取数据.以下是尝试:

And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)

for i in root.iter(tree):
   df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)

df_xml.head()

上面的代码的结果是:

 IRC IRC2 Artist Song Units PPD TerritoryCode

哪个是我要创建的文件的标题. 我找不到将所需信息带入这些列的方法.

Which is the header of the file that I want to create. I cannot find a way to bring the information I need into these columns.

我也尝试过:

def getValOfNode(node):
    return node.text if node is not None else None


def main():

    dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
    df_xml = pd.DataFrame(columns = dfcols)

    for node in tree:
        IRC = node.find('IRC')
        IRC2 = node.find('ICPN')
        Artist = node.find('rtistName')
        Song = node.find('Title')
        Units = node.find('ConsumerSales')
        PPD = node.find('Amount')
        TerritoryCode = node.find('TerritoryCode')

        df_xml = df_xml.append(
            pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)

    print(df_xml)


main()

我收到此错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
     23 
     24 
---> 25 main()

<ipython-input-5-1f816143f9e4> in main()
      8     df_xml = pd.DataFrame(columns = dfcols)
      9 
---> 10     for node in tree:
     11         IRC = node.find('IRC')
     12         IRC2 = node.find('ICPN')

TypeError: 'ElementTree' object is not iterable

运行时,地区代码也存在问题:

There is also an issue with the territory code, when I run:

for info in root.iter('TerritoryCode'):
   print(info.text)

它打印出这些区域,但是由于它们是重复的(我不知道如何解释),所以我真的需要所有这些,而不仅仅是每个.如果那是有道理的. 这就是我得到的:

it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense. This is what I get:

AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN

这就是我需要的:

AD
AD
AE
AE
AE
AE
AE
AE,

,依此类推.

有人可以帮我吗? 非常感激.

Can anyone help me with this? Much appreciated.

祝你有美好的一天:)

Have a great day :)

推荐答案

如上所述,您所需的节点位于XML的不同级别,因此每个数据项的路径表达式都不同.另外,您需要遍历两个重复级别: SalesToRecordCompanyByTerritory ReleaseTransactionsToRecordCompany .

As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.

因此,请考虑在嵌套的for循环中进行解析.与其在循环内增加数据框架,不如构建一个字典列表,您可以将其传递给循环外熊猫的DataFrame()构造函数.通过这种方法,您可以将字典键迁移为列,将元素迁移为数据.

Therefore, consider parsing in nested for loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame() constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.

以下使用链式find()调用,较长的相对路径或较短的绝对路径来导航嵌套级别并检索相应的元素文本值.请注意,所有解析都是相对于具有父对象terr和子对象rls的循环节点的.

Below uses chained find() calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr and child rls objects.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("file.xml")

data = []
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):

    for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):

        inner = {}

        # DESCENDANTS
        inner['IRC'] = rls.find('./ReleaseId/ISRC').text    
        inner['IRC2'] = rls.find('./ReleaseId/ICPN').text

        # CHILDREN
        inner['Artist'] = rls.find('WMGArtistName').text
        inner['Song'] = rls.find('WMGTitle').text

        # DESCENDANTS
        inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text    
        inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text

        # PARENT
        inner['TerritoryCode'] = terr.find('./TerritoryCode').text

        data.append(inner)

df = pd.DataFrame(data)

您可以使用.//缩短带有绝对路径的find()链和较长的相对路径:

You can shorten the find() chains and long relative paths with absolute paths using .//:

inner['IRC'] = rls.find('.//ISRC').text    
inner['IRC2'] = rls.find('.//ICPN').text

inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text  

这篇关于使用Python将数据从XML树提取到pandas/csv中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆