如何打开此XML文件以在Python中创建数据框? [英] How to open this XML file to create dataframe in Python?

查看:93
本文介绍了如何打开此XML文件以在Python中创建数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人建议在以下站点上打开xml数据并将其放入python中的数据框(我更喜欢与熊猫一起工作)的最佳方法吗?该文件位于此站点上的数据-XML(sdmx/zip)"链接上:

Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the "Data - XML (sdmx/zip)" link on this site:

http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs .html

我尝试通过从 http复制来使用以下内容://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html ,看来我正在接近:

I've tried using the following by copying from http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html, and it seems I'm getting close:

from lxml import objectify
import pandas as pd

path = 'feds200628.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('id', 'name'))

for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

不过,我对xml的了解还不足以让我了解其余的方法.

Still, I don't know enough about xml to get me the rest of the way.

任何帮助都会很棒-我什至不需要将其放在数据框中,我只需要弄清楚如何以某种方式在python中解析此内容.

Any help would be awesome - I don't even need it to be in a dataframe, I just need to figure out how to parse this content in python somehow.

推荐答案

XML是树状结构,而Pandas DataFrame是2D表状结构.因此,没有自动的方法可以在两者之间进行转换.您必须了解XML结构并知道如何将其数据映射到2D表上. 因此,每个XML-to-DataFrame问题都是不同的.

XML is a tree-like structure, while a Pandas DataFrame is a 2D table-like structure. So there is no automatic way to convert between the two. You have to understand the XML structure and know how you want to map its data onto a 2D table. Thus, every XML-to-DataFrame problem is different.

您的XML有2个数据集,每个数据集包含多个Series.每个系列包含许多Obs元素.

Your XML has 2 DataSets, each containing a number of Series. Each Series contains a number of Obs elements.

每个系列都有一个NAME属性,每个Obs都有OBS_STATUS,TIME_PERIOD和OBS_VALUE属性.因此,创建带有NAME,OBS_STATUS,TIME_PERIOD和OBS_VALUE列的表也许是合理的.

Each Series has a NAME attribute, and each Obs has OBS_STATUS, TIME_PERIOD and OBS_VALUE attributes. So perhaps it would be reasonable to create a table with NAME, OBS_STATUS, TIME_PERIOD, and OBS_VALUE columns.

我发现从XML中提取所需数据有点复杂,这使我怀疑是否找到了实现此目标的最佳方法.但这是一种方法(PS.托马斯·马洛尼(Thomas Maloney)从类似于2D表格的XLS数据开始的想法应该更简单):

I found pulling the desired data out of the XML a bit complicated, which makes me doubtful that I've found the best way to do it. But here is one way (PS. Thomas Maloney's idea of starting with the 2D table-like XLS data should be way simpler):

import lxml.etree as ET
import pandas as pd

path = 'feds200628.xml'

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    http://stackoverflow.com/a/7171543/190597 (unutbu)
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

data = list()
obs_keys = ['OBS_STATUS', 'TIME_PERIOD', 'OBS_VALUE']
columns = ['NAME'] + obs_keys

def process_obs(elem, name):
    dct = elem.attrib
    # print(dct)
    data.append([name] + [dct[key] for key in obs_keys])

def process_series(elem):
    dct = elem.attrib
    # print(dct)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}Obs'
        )
    fast_iter(context, process_obs, dct['SERIES_NAME'])

def process_dataset(elem):
    nsmap = elem.nsmap
    # print(nsmap)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{{{prefix}}}Series'.format(prefix=elem.nsmap['kf'])
        )
    fast_iter(context, process_series)

with open(path, 'rb') as f:
    context = ET.iterparse(
        f, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}DataSet'
        )
    fast_iter(context, process_dataset)
    df = pd.DataFrame(data, columns=columns)

收益

            NAME OBS_STATUS TIME_PERIOD   OBS_VALUE
0        SVENY01          A  1961-06-14      2.9825
1        SVENY01          A  1961-06-15      2.9941
2        SVENY01          A  1961-06-16      3.0012
3        SVENY01          A  1961-06-19      2.9949
4        SVENY01          A  1961-06-20      2.9833
5        SVENY01          A  1961-06-21      2.9993
6        SVENY01          A  1961-06-22      2.9837
...
1029410     TAU2          A  2014-09-19  3.72896779
1029411     TAU2          A  2014-09-22  3.12836171
1029412     TAU2          A  2014-09-23  3.20146575
1029413     TAU2          A  2014-09-24  3.29972110

这篇关于如何打开此XML文件以在Python中创建数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆