RAM因XML到DataFrame转换功能而崩溃 [英] RAM crashed for XML to DataFrame conversion function
问题描述
我创建了以下函数,该函数将XML文件转换为DataFrame。此功能适用于小于1 GB的文件,以及大于RAM(13GB Google Colab RAM)崩溃的文件。如果我在Jupyter Notebook(4GB Laptop RAM)上本地尝试,也会发生同样的情况。有没有一种方法可以优化代码?
I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code?
代码
#Libraries
import pandas as pd
import xml.etree.cElementTree as ET
#Function to convert XML file to Pandas Dataframe
def xml2df(file_path):
#Parsing XML File and obtaining root
tree = ET.parse(file_path)
root = tree.getroot()
dict_list = []
for _, elem in ET.iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append(elem.attrib) # PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
return df
XML文件的一部分('Badges.xml')
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
我也尝试了 SAX
代码,但是得到了相同的RAM错误严重。
导入xml.sax
I also tried the SAX
code but I get the same RAM Exhausted error.
import xml.sax
import xml.sax
class BadgeHandler(xml.sax.ContentHandler):
def __init__(self):
self.row = None
self.row_data = []
self.df = None
# Call when an element starts
def startElement(self, tag, attributes):
if tag == 'row':
self.row = attributes._attrs
# Call when an elements ends
def endElement(self, tag):
if self.row and tag == 'row':
self.row_data.append(self.row)
def endDocument(self):
self.df = pd.DataFrame(self.row_data)
LOAD_FROM_FILE = True
handler = BadgeHandler()
if LOAD_FROM_FILE:
print('loading from file')
# 'rows.xml' is a file that contains your XML example
xml.sax.parse('/content/Badges.xml', handler)
else:
print('loading from string')
xml.sax.parseString(xml_str, handler)
print(handler.df)
推荐答案
我决定对此进行更深入的研究。
I decided to dig deeper into this.
事实证明,从字典列表创建数据框时,Pandas 内存效率非常低为什么。
It turns out Pandas is very inefficient memory-wise when creating dataframes from a list-of-dicts for who knows why.
您可以找到我的完整实验代码(生成的千兆字节的XML并在GitHub上读取它),但要点是(在我的Python 3.8和macOS上)
You can find my full experiment code (that generates a gigabyte of XML and reads it) on GitHub, but the gist of it is that (on my Python 3.8, macOS)
-
XML文档到数据框,其代码改编自@balderman的答案(
read_xml_to_pd.py
):
- 需要6,838,556k(〜7 GB)到10,508,892k(〜10 GB)内存(谁知道为什么会发生变化)和大约52秒的时间将数据读入内存
- 12,128,400k(12.1 GB)存储数据和数据帧的内存
将XML文档读取到CSV文件(使用SAX) :
reading the XML document to a CSV file (with SAX):
- 获取s 16-17 MB的内存和大约1.5分钟的时间来写入400 MB的
badges.csv
(python read_xml_to_csv.py
) - 占用多达989.080万(2.9 GB)的内存,使用
pd.read_csv()
大约需要10秒钟来读取CSV (read_csv_to_pd.py
) - 仅保存数据帧最终需要2,033,208k(2.0 GB)内存
- takes 16-17 megabytes of memory and some 1.5 minutes to write a 400-megabyte
badges.csv
(python read_xml_to_csv.py
) - takes up to 2,989,080k (2.9 GB) memory and about 10 seconds to read the CSV using
pd.read_csv()
(read_csv_to_pd.py
) - finally 2,033,208k (2.0 GB) memory is required to just hold the dataframe
二进制中间格式可能仍会更快,更高效。
A binary intermediate format would probably be faster and more efficient still.
这篇关于RAM因XML到DataFrame转换功能而崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!