如何在 Python 中以简单的方式拆分 XML 文件? [英] How to split an XML file the simple way in Python?

查看:47
本文介绍了如何在 Python 中以简单的方式拆分 XML 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有用于解析 XML 文件的 Python 代码 此处详述.我知道 XML 文件在内存中操作时会占用系统资源而臭名昭著.我的解决方案适用于较小的 XML 文件(比如 200KB,我有一个 340MB 的文件).

I have Python code for parsing an XML file as detailed here. I understand that XML files are notorious for hogging system resources when manipulated in memory. My solution works for smaller XML files (say 200KB and I have a 340MB file).

我开始研究 StAX(拉式解析器)实现,但我的日程安排很紧,我正在寻找一种更简单的方法来完成这项任务.

I started researching StAX (pull parser) implementation but I am running on a tight schedule and I am looking for a much simpler approach for this task.

我了解创建较小的文件块,但如何通过每次输出主/标头标签来提取正确的元素?

I understand the creation of smaller chunks of files but how do I extract the right elements by outputting the main/header tags every time?

例如,这是架构:

<?xml version="1.0" ?>
<!--Sample XML Document-->
<bookstore>
    <book Id="1">
      ....
      ....
    </book> 
    <book Id="2">
      ....
      ....
    </book> 
    <book Id="3">
      ....
      ....
    </book> 
    ....
    ....
    ....
    <book Id="n">
      ....
      ....
    </book> 
</bookstore>

如何为每 1000 个书籍元素创建带有标题数据的新 XML 文件?代码和数据集的具体例子请参考我的另一个这里有问题.非常感谢.

How do I create new XML files with header data for every 1000 book elements? For a concrete example of the code and data set, please refer to my other question here. Thanks a lot.

我想要做的就是避免一次性在内存中加载数据集.我们可以以流式方式解析 XML 文件吗?我的想法是否正确?

All I want to do is avoid in-memory loading of the dataset all at once. Can we parse the XML file in a streaming fashion? Am I thinking along the right lines?

ps:我的情况类似于问题2009 年.一旦我为我的问题找到更简单的解决方案,我将在此处发布答案.感谢您的反馈.

p.s : My situation is similar to a question asked in 2009. Will post an answer here once I find a simpler solution for my problem. Your feedback is appreciated.

推荐答案

您可以增量解析你的大 XML 文件:

from xml.etree.cElementTree import iterparse

# get an iterable and turn it into an iterator
context = iter(iterparse("path/to/big.xml", events=("start", "end")))

# get the root element
event, root = next(context)
assert event == "start"

for event, elem in context:
    if event == "end" and elem.tag == "book":
       # ... process book elements ...
       root.clear()

这篇关于如何在 Python 中以简单的方式拆分 XML 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆