使用Python对大数据进行XML解析 [英] XML parsing in Python for big data
问题描述
我正在尝试使用Python解析XML文件.但是问题在于XML文件大小约为30GB.因此,执行需要花费几个小时:
I am trying to parse an XML file using Python. But the problem is that the XML file size is around 30GB. So, it's taking hours to execute:
tree = ET.parse('Posts.xml')
在我的XML文件中,根有数百万个子元素.有什么办法可以使其更快?我不需要所有的孩子来解析.即使是第一个100,000,也可以.我所需要做的就是为解析深度设置一个限制.
In my XML file, there are millions of child elements of the root. Is there any way to make it faster? I don't need all the children to parse. Even the first 100,000 would be fine. All I need is to set a limit for the depth to parse.
推荐答案
您将需要一种XML解析机制,该机制不会将所有内容都加载到内存中.
You'll want an XML parsing mechanism that doesn't load everything into memory.
您可以使用 ElementTree.iterparse
,或者您可以使用 Sax .
这是一个页面,其中包含一些 XML处理教程用于Python.
Here is a page with some XML processing tutorials for Python.
更新:正如@marbu在评论中所说,如果您使用ElementTree.iterparse
,请确保
UPDATE: As @marbu said in the comment, if you use ElementTree.iterparse
be sure to use it in such a way that you get rid of elements in memory when you've finished processing them.
这篇关于使用Python对大数据进行XML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!