解析\加载非常大的 xml 文件时出现内存错误 [英] memory error while parsing\loading very large xml file

查看：77 发布时间：2021/6/13 19:30:25 python xml parsing out-of-memory

本文介绍了解析\加载非常大的 xml 文件时出现内存错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在一个 xml 文件中有一个大型数据库，我需要处理其中的数据(使用 python).

我尝试使用 xml.dom.minidom 和(在另一个脚本中)xml.etree.ElementTree 和 xml 库解析它然后逐个标签获取深度标签，直到标签~~，然后遍历我需要的标签()以检索相关数据.>~~

~~我的问题是文件非常大 (217 MB)，我无法解析或加载它.我不断收到内存错误，甚至没有加载.~~

~~文件的结构是这样的:~~

<头>...<身体><s id=s1"><图表><终端><t id="s1_1";ex=bla"ex2=bla2"/><t id="s1_2";ex=bla"ex2=bla2"/><t id="s1_3";ex=bla"ex2=bla2"/></终端></s><s id=s2"><图表><终端><t id="s2_1";ex=bla"ex2=bla2"/><t id="s2_2";ex=bla"ex2=bla2"/><t id="s12_3";ex=bla"ex2=bla2"/></终端></s>.... # 超过 50K <s>标签和近 100 万个 <t>标签</语料库>

~~我真正需要的是检索所有 <t/> 标签并将它们的属性数据存储在 csv 或其他东西中，但计算机无法解析大文件.~~

~~我很乐意阅读您的建议.~~

~~非常感谢！~~

~~解决方案~~

~~试试这个 xml 库.pip安装simplified_scrapy~~

from Simplified_scrapy import SimplifiedDoc, utils文档 = 简化文档()doc.loadFile('test.xml', lineByline=True) # 逐行读取数据对于 doc.getIterable('s') 中的 s:打印(s.selects('t'))

~~I got a large database in one xml file and I need to process the data in it (using python).~~

I tried to parse it with xml library using xml.dom.minidom and (in another script) xml.etree.ElementTree and then get deep tag by tag until the tag <s>, and then iterate over the tags I need (<t>) to retrieve the relevant data.

My problem is that the file is really large (217 MB) and I cannot parse or load it. I keep getting a memory error and it is not even loaded.

The structure of the file is this:
<corpus> <head> ... </head> <body> <s id="s1"> <graph> <terminals> <t id="s1_1" ex="bla" ex2="bla2"/> <t id="s1_2" ex="bla" ex2="bla2"/> <t id="s1_3" ex="bla" ex2="bla2"/> </terminals> </graph> </s> <s id="s2"> <graph> <terminals> <t id="s2_1" ex="bla" ex2="bla2"/> <t id="s2_2" ex="bla" ex2="bla2"/> <t id="s12_3" ex="bla" ex2="bla2"/> </terminals> </graph> </s> .... # more than 50K <s> tags and almost 1M <t> tags </body> </corpus>
What I really need is to retrieve all the <t/> tags and to store the data of their attributes in a csv or something, but the computer cannot parse the large file.

I would be very happy to read your advice.

Thank you very much!
解决方案
Try this xml library. pip install simplified_scrapy
from simplified_scrapy import SimplifiedDoc, utils doc = SimplifiedDoc() doc.loadFile('test.xml', lineByline=True) # Read data line by line for s in doc.getIterable('s'): print (s.selects('t'))

这篇关于解析\加载非常大的 xml 文件时出现内存错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

~~查看全文~~

解析\加载非常大的 xml 文件时出现内存错误 [英] memory error while parsing\loading very large xml file

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析\加载非常大的 xml 文件时出现内存错误 [英] memory error while parsing\loading very large xml file

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭