将xml解析为pandas数据框会引发内存错误 [英] Parsing xml to pandas data frame throws memory error

查看：49 发布时间：2020/5/23 23:50:12 python xml pandas

本文介绍了将xml解析为pandas数据框会引发内存错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将1.父属性2.子属性和3.孙子文本放入数据框中.我能够将child属性和孙子文本打印在屏幕上，但是我无法让它们进入数据框.我从熊猫那里收到内存错误.

I am trying to put 1. a parent attribute 2. a child attribute and 3. a grandchild text into a data frame. I am able to get the child attribute and the grandchild text to print out on the screen, but I cannot get them to go into a data frame. I get a memory error from pandas.

这是介绍性的东西

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

这显示了我如何解析以获取child属性和孙子属性进行打印.

This shows how I am parsing to get the child attribute and grandchild to print.

for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
            print '@' + attrib + '=' + df.attrib[attrib]
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
            print 'subfield=' + subfield.text

它会按照要求打印出很长的信息:

It yields a very long print out with the information as requested:

...
@Time=2013-05-01 23:01:00
value=100
@Time=2013-05-01 23:02:00
value=101
@Time=2013-05-01 23:03:00
value=99
@Time=2013-05-01 23:04:00
value=101
...

但是，当我尝试将其放入数据帧时，出现内存错误.我尝试了这两个方法，也只是尝试将child属性添加到数据框中.

However, when I try to put it into a data frame, I get a memory error. I tried with both of them an also with just trying to get the child attribute into a data frame.

data = []
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
        el_data = {}
        el_data[attrib] = df.attrib[attrib]
    data.append(el_data)
from pandas import *
perf = DataFrame(data)
perf

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
      1 from pandas import *
----> 2 perf = DataFrame(data)
      3 perf

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    417 
    418                 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    420                     columns = _ensure_index(columns)
    421 

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5457         return _list_of_dict_to_arrays(data, columns,
   5458                                        coerce_float=coerce_float,
-> 5459                                        dtype=dtype)
   5460     elif isinstance(data[0], Series):
   5461         return _list_of_series_to_arrays(data, columns,

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
   5521             for d in data]
   5522 
-> 5523     content = list(lib.dicts_to_array(data, list(columns)).T)
   5524     return _convert_object_array(content, columns, dtype=dtype,
   5525                                  coerce_float=coerce_float)

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)()

MemoryError:

我的xml文件中有值"的12960个值.我认为这些内存错误告诉我有关文件中值的信息不符合预期的情况，但这与内存错误不符，并且我无法从其他有关内存错误的问题或熊猫文档.

I have 12960 values of "value" in my xml file. I assume that these memory errors are telling me something about the values in the file not meeting what is expected, but that doesn't match with a memory error, and I could not figure it out from other SO questions regarding memory errors or from the pandas documentation.

尝试获取数据类型不会产生任何信息.也许没有类型?也许是因为它们是元素树中的元素. (我尝试打印.pyval，但它只告诉我没有属性.)el_data的类型为"dict"

An attempt to get the data types yields no information. Maybe there are no types? Perhaps because they are elements in an element tree. (I tried to print .pyval, but it only told me there was no attribute.) el_data is of type "dict"

print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
            Time = None [_Element]
              * Time = '2013-05-01 00:00:00'
                value = '258' [_Element]
            Time = None [_Element]
              * Time = '2013-05-01 00:01:00'
                value = '259' [_Element]
type(el_data)
dict

我基于《 Python for Data Analysis》一书以及在SO上找到的其他用于解析XML的示例构建了此代码.我还是python的新手.

I built this code based on the book Python for Data Analysis and other examples found on SO for parsing XML. I am still new to python.

在Mac OS 10.7.5上运行Python 2.7.2

Running Python 2.7.2 on Mac OS 10.7.5

将xml解析为pandas数据框会引发内存错误 [英] Parsing xml to pandas data frame throws memory error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将xml解析为pandas数据框会引发内存错误 [英] Parsing xml to pandas data frame throws memory error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭