并发SAX处理大型,简单的XML文件? [英] Concurrent SAX processing of large, simple XML files?

查看:159
本文介绍了并发SAX处理大型,简单的XML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个巨大的XML文件(10GB-40GB),它们有一个非常简单的结构:只有一个包含多个行节点的根节点。我试图使用SAX在Python中解析他们,但我必须为每一行做额外的处理意味着40GB文件需要一整天的时间来完成。为了加快速度,我想同时使用我所有的核心。不幸的是,似乎SAX解析器无法处理畸形的XML块,这是你得到当你寻找到文件中的任意行,并尝试从那里解析。由于SAX解析器可以接受流,我想我需要将我的XML文件分为八个不同的流,每个包含[行数] / 8行,并填充假开关标签。我怎么会这样做?或者 - 有没有更好的解决方案,我可能会失踪?谢谢!

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!

推荐答案

您无法轻松地将SAX解析分割成多个主题,而且您不需要:if你只需运行解析而无需任何其他处理,它应该在20分钟左右运行。专注于处理您对ContentHandler中的数据的处理。

You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.

这篇关于并发SAX处理大型,简单的XML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆