RAM因XML到DataFrame转换功能而崩溃 [英] RAM crashed for XML to DataFrame conversion function

查看:77
本文介绍了RAM因XML到DataFrame转换功能而崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了以下函数,该函数将XML文件转换为DataFrame。此功能适用于小于1 GB的文件,以及大于RAM(13GB Google Colab RAM)崩溃的文件。如果我在Jupyter Notebook(4GB Laptop RAM)上本地尝试,也会发生同样的情况。有没有一种方法可以优化代码?

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code?

代码

#Libraries
import pandas as pd
import xml.etree.cElementTree as ET

#Function to convert XML file to Pandas Dataframe    
def xml2df(file_path):

  #Parsing XML File and obtaining root
  tree = ET.parse(file_path)
  root = tree.getroot()

  dict_list = []

  for _, elem in ET.iterparse(file_path, events=("end",)):
      if elem.tag == "row":
        dict_list.append(elem.attrib)      # PARSE ALL ATTRIBUTES
        elem.clear()

  df = pd.DataFrame(dict_list)
  return df

XML文件的一部分('Badges.xml')

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />

我也尝试了 SAX 代码,但是得到了相同的RAM错误严重。
导入xml.sax

I also tried the SAX code but I get the same RAM Exhausted error. import xml.sax

import xml.sax    

class BadgeHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.row = None
        self.row_data = []
        self.df = None

    # Call when an element starts
    def startElement(self, tag, attributes):
        if tag == 'row':
            self.row = attributes._attrs

    # Call when an elements ends
    def endElement(self, tag):
        if self.row and tag == 'row':
            self.row_data.append(self.row)

    def endDocument(self):
        self.df = pd.DataFrame(self.row_data)

LOAD_FROM_FILE = True

handler = BadgeHandler()
if LOAD_FROM_FILE:
    print('loading from file')
    # 'rows.xml' is a file that contains your XML example
    xml.sax.parse('/content/Badges.xml', handler)
else:
    print('loading from string')
    xml.sax.parseString(xml_str, handler)
print(handler.df)


推荐答案

我决定对此进行更深入的研究。

I decided to dig deeper into this.

事实证明,从字典列表创建数据框时,Pandas 内存效率非常低为什么。

It turns out Pandas is very inefficient memory-wise when creating dataframes from a list-of-dicts for who knows why.

您可以找到我的完整实验代码(生成的千兆字节的XML并在GitHub上读取它),但要点是(在我的Python 3.8和macOS上)

You can find my full experiment code (that generates a gigabyte of XML and reads it) on GitHub, but the gist of it is that (on my Python 3.8, macOS)


  • XML文档到数据框,其代码改编自@balderman的答案( read_xml_to_pd.py ):


  • 需要6,838,556k(〜7 GB)到10,508,892k(〜10 GB)内存(谁知道为什么会发生变化)和大约52秒的时间将数据读入内存

  • 12,128,400k(12.1 GB)存储数据和数据帧的内存

将XML文档读取到CSV文件(使用SAX) :

reading the XML document to a CSV file (with SAX):


  • 获取s 16-17 MB的内存和大约1.5分钟的时间来写入400 MB的 badges.csv python read_xml_to_csv.py

  • 占用多达989.080万(2.9 GB)的内存,使用 pd.read_csv()大约需要10秒钟来读取CSV ( read_csv_to_pd.py

  • 仅保存数据帧最终需要2,033,208k(2.0 GB)内存

  • takes 16-17 megabytes of memory and some 1.5 minutes to write a 400-megabyte badges.csv (python read_xml_to_csv.py)
  • takes up to 2,989,080k (2.9 GB) memory and about 10 seconds to read the CSV using pd.read_csv() (read_csv_to_pd.py)
  • finally 2,033,208k (2.0 GB) memory is required to just hold the dataframe

二进制中间格式可能仍会更快,更高效。

A binary intermediate format would probably be faster and more efficient still.

这篇关于RAM因XML到DataFrame转换功能而崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆