将xml解析为pandas数据框会引发内存错误 [英] Parsing xml to pandas data frame throws memory error

查看:49
本文介绍了将xml解析为pandas数据框会引发内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将1.父属性2.子属性和3.孙子文本放入数据框中.我能够将child属性和孙子文本打印在屏幕上,但是我无法让它们进入数据框.我从熊猫那里收到内存错误.

I am trying to put 1. a parent attribute 2. a child attribute and 3. a grandchild text into a data frame. I am able to get the child attribute and the grandchild text to print out on the screen, but I cannot get them to go into a data frame. I get a memory error from pandas.

这是介绍性的东西

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

这显示了我如何解析以获取child属性和孙子属性进行打印.

This shows how I am parsing to get the child attribute and grandchild to print.

for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
            print '@' + attrib + '=' + df.attrib[attrib]
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
            print 'subfield=' + subfield.text

它会按照要求打印出很长的信息:

It yields a very long print out with the information as requested:

...
@Time=2013-05-01 23:01:00
value=100
@Time=2013-05-01 23:02:00
value=101
@Time=2013-05-01 23:03:00
value=99
@Time=2013-05-01 23:04:00
value=101
...

但是,当我尝试将其放入数据帧时,出现内存错误.我尝试了这两个方法,也只是尝试将child属性添加到数据框中.

However, when I try to put it into a data frame, I get a memory error. I tried with both of them an also with just trying to get the child attribute into a data frame.

data = []
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
        el_data = {}
        el_data[attrib] = df.attrib[attrib]
    data.append(el_data)
from pandas import *
perf = DataFrame(data)
perf

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
      1 from pandas import *
----> 2 perf = DataFrame(data)
      3 perf

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    417 
    418                 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    420                     columns = _ensure_index(columns)
    421 

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5457         return _list_of_dict_to_arrays(data, columns,
   5458                                        coerce_float=coerce_float,
-> 5459                                        dtype=dtype)
   5460     elif isinstance(data[0], Series):
   5461         return _list_of_series_to_arrays(data, columns,

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
   5521             for d in data]
   5522 
-> 5523     content = list(lib.dicts_to_array(data, list(columns)).T)
   5524     return _convert_object_array(content, columns, dtype=dtype,
   5525                                  coerce_float=coerce_float)

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)()

MemoryError: 

我的xml文件中有值"的12960个值.我认为这些内存错误告诉我有关文件中值的信息不符合预期的情况,但这与内存错误不符,并且我无法从其他有关内存错误的问题或熊猫文档.

I have 12960 values of "value" in my xml file. I assume that these memory errors are telling me something about the values in the file not meeting what is expected, but that doesn't match with a memory error, and I could not figure it out from other SO questions regarding memory errors or from the pandas documentation.

尝试获取数据类型不会产生任何信息.也许没有类型?也许是因为它们是元素树中的元素. (我尝试打印.pyval,但它只告诉我没有属性.)el_data的类型为"dict"

An attempt to get the data types yields no information. Maybe there are no types? Perhaps because they are elements in an element tree. (I tried to print .pyval, but it only told me there was no attribute.) el_data is of type "dict"

print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
            Time = None [_Element]
              * Time = '2013-05-01 00:00:00'
                value = '258' [_Element]
            Time = None [_Element]
              * Time = '2013-05-01 00:01:00'
                value = '259' [_Element]
type(el_data)
dict

我基于《 Python for Data Analysis》一书以及在SO上找到的其他用于解析XML的示例构建了此代码.我还是python的新手.

I built this code based on the book Python for Data Analysis and other examples found on SO for parsing XML. I am still new to python.

在Mac OS 10.7.5上运行Python 2.7.2

Running Python 2.7.2 on Mac OS 10.7.5

推荐答案

基于Jeff和JoeKington的帮助.在将数据推入数据框之前,需要将它们分别放入列表中.内存错误是由无法放入数据帧的多个元素"引起的.取而代之的是,每个元素字典都需要放入一个可以放入数据框的列表中.

Answer based on help from Jeff and JoeKington. The data needed to be put into lists separately before being pushed into the dataframe. The memory error was being caused by the multiple "elements" which were not able to be put into a data frame. Instead, each element dict needs to be put into a list which can go into a data frame.

这有效:

dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})

pef

&ltclass 'pandas.core.frame.DataFrame'&gt
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time     12960  non-null values
value    12960  non-null values
dtypes: object(2) 

pef[:5]

    Time                    value
0    2013-05-01 00:00:00    258
1    2013-05-01 00:01:00    259
2    2013-05-01 00:02:00    258
3    2013-05-01 00:03:00    257
4    2013-05-01 00:04:00    257

这篇关于将xml解析为pandas数据框会引发内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆