解析超过一百万个xml文件 [英] Parse over a million xml files

查看:88
本文介绍了解析超过一百万个xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找解析超过一百万个大小从2KB到10MB的XML文件的最佳方法。总的来说,文件加起来在500GB左右的某个地方。应用程序从整个文件中的各个节点收集数据,并将它们推送到Postgres数据库模式中。我使用etree编写了python代码,很久以前当XML文件的数量要小得多时就这样做了。但现在需要将近一周的时间来处理。有关扩大规模的最佳策略的想法吗?如果我能在一两天内完成这些工作,那将是一个巨大的进步。



我尝试过的事情:



  class  ParseJob(threading.Thread):
path_queue =
stops =
def __init __(self,path_queue,stopper):
super(self .__ class __,self).__ init __()
self.path_queue = path_queue
self.stopper = stopper
def run(self):
while not not self.stopper.is_set():
尝试
path = self.path_queue.get_nowait()
new = open(path,' rb'
xmldoc = minidom.parse(new)
parseFunc(xmldoc)
new.close()
self.path_queue.task_done()
Queue.Empty:
break

def parseFunc(xmldoc):
#执行所有解析

def main():
path_queue = Queue.Queue()
dir = #xml文件的路径
路径 in dir:
tile_queue.put(path)
stopper = threading.Event()
num_workers = 8
threads = list()
for i in 范围(num_workers):
job = ParseJob(path_queue,stops)
threads.append(job)
job.start()
path_queue。 join()

main()

解决方案

我可能有一个管理队列的组件,加载它与要处理的文件路径。该组件将分发文件路径以处理工作线程,该工作线程为单个文件执行所有工作。您可以让这些工作线程尽可能多地运行,因为CPU中有可用的内核。当一个工人完成一个文件后,它可以回到队列并获得另一个文件。



由于文件正在处理中,它可能是一个如果数据足够重要,最好删除它们或将它们移动到某个存档存储。在设计这个时必须记住的是,如果系统在处理过程中崩溃或出现问题,会发生什么。算法如何从中断的工作中恢复?



完成所有文件仍然需要很长时间,但你肯定会减少工作时间接近一天取决于你可以保持忙碌的核心数量。


如果缩短时间这么简单,为什么你认为我们有一个科目或课程标题为数据结构和算法 [ ^ ]? :-)预计数据会增长,但您需要确保逻辑以预期的方式运行,而不是继续保持运行数周。



有很多方法可以缩短时间。我会在列表中给你所有这些点。但是,我希望你会尝试跟随它们,因为没有其他方式!您应遵循这些规则以改善所需时间。



1.更改语言! Python根本不是更快。你知道Python是解释语言吗?这让它慢得多。

- 使用像C ++这样的东西。它具有相同的范例,我相信它会在CodeProject,GitHub或其他任何地方提供预构建的库。

2.更改数据排列方式。

- 数据的呈现非常重要。小文件和大文件也会导致问题。在每个文件之后,程序必须清理RAM并输入下一个文件。查找具有小块文件的替代方案。

3.提高CPU速度。您不希望使用个人计算机完成超级计算机所做的工作。这没有任何意义。

4.再想一想!



最常用的是常识。您的数据超过500GB。为什么?此外,当您要查询数据时,为什么要查询所有数据?在更新数据结构时,您应该考虑和思考这些内容,同时更新算法以及更新系统硬件的原因。



否则,这个时间可以缩短,最大输出量减少1天(仍然是6天!),仅此而已。


在速度优化方面,选择的工具是剖析器。

剖析器可以告诉你如何你在代码的每个部分花了很多时间来检测瓶颈

从你的代码片段中,可以猜到大部分时间花在上parseFunc ,在您未显示的代码中。

即使语言的选择对代码的效率有影响,您设计代码的方式也会更加戏剧化效果。



为了了解我们是否可以改进运行时,我们需要确切知道你在每个文件中做了什么,以及你要回答的请求是什么。 / BLOCKQUOTE>

I am looking for the best way to parse through over a million XML files ranging in size from 2KB-10MB. In total, files add up to somewhere in the neighborhood of 500GB. The application collects data from various nodes throughout the file and shoves them into a Postgres Database schema. I wrote python code using etree that did this a long time ago when the number of XML files was much smaller. But now it takes close to a week to process. Any idea on the best strategy to scale this? If I could get process these in a day or two it would be a huge improvement.

What I have tried:

class ParseJob(threading.Thread):
    path_queue = None
    stopper = None    
    def __init__(self, path_queue,stopper):
        super(self.__class__,self).__init__()
        self.path_queue = path_queue
        self.stopper = stopper
    def run(self):
        while not self.stopper.is_set():
            try:
                path = self.path_queue.get_nowait()                       
                new = open(path, 'rb')
                xmldoc = minidom.parse(new)
                parseFunc(xmldoc)
                new.close()
                self.path_queue.task_done()    
            except Queue.Empty:
                break

def parseFunc(xmldoc):
    ##does all the parsing       
            
def main():
    path_queue = Queue.Queue()
    dir = ##path to xml files
    for path in dir:
        tile_queue.put(path)
    stopper = threading.Event()  
    num_workers = 8
    threads = list()
    for i in range(num_workers):
        job = ParseJob(path_queue,stopper)
        threads.append(job)
        job.start()
    path_queue.join()

main()

解决方案

I would probably have a component that manages a queue, loading it with the file paths to process. This component would hand out a file path to process to a worker thread that does all the work for a single file. You can have as many of these worker threads running as there are cores available in the CPU. When a worker is done with a file it can go back to the queue and get another file.

As the files are done being processed, it'd probably be a good idea to delete them or move them to some archive storage if the data is important enough. The thing you have to keep in mind when designing this is what happens if the system crashes or something during processing. How is the algorithm going to recover from an interrupted job?

It'll still take a long time to get through all the files, but you'll definitely cut down the job time to closer to a day depending on the number of cores you can keep busy.


If, cutting short the time was so simple, why do you think we have a subject or course titled as, "Data Structures and Algorithms[^]"? :-) The data is expected to grow, but you as the database administrator or system administrator are required to make sure that the logic runs in the same way it was expected to, not to continue to keep running for weeks.

There are many ways to cut short that time. I would give you all of those points in a list. But, I hope you will try to follow them because there is no "other way"! You are expected to follow these rules to improve the time required.

1. Change the language! Python is not at all faster. Did you know that Python is interpreted language? Which makes it much slower.
- Use something like C++. It has same paradigm and I am sure it would have a prebuilt library already available on CodeProject, GitHub or anywhere else.
2. Change how you arrange the data.
- Data's presentation is very much important. Small files and large files also cause a problem. After each file, program has to clean the RAM and input the next file. Find an alternate to having small chunks of files.
3. Increase the CPU speed. You don't want to do the job that a supercomputer does, using a personal computer. That doesn't make any sense.
4. Think again!

The most common thing to use is the common sense. Your data spans over 500GB. Why? Also, when you want to query the data, why do you want to query all of it? These are a few things that you should consider and think while updating the data structures, while updating your algorithms and why updating the system hardware.

Otherwise, this time can be cut short with a maximum output of 1 day less (which is still 6 days!) and nothing more.


When it comes to speed optimization, the tool of choice is the profiler.
the profiler is there to tell you how much time you spend in each part of your code and to detect bottlenecks.
From your piece of code, one can guess most time is spend in parseFunc, in the code you didn't show.
Even if the choice of language have effect in efficiency of your code, the way you designed your code is even more dramatic effects.

In order to see if we can improve the runtime, we need to know exactly what you do in each file and what are the requests you are answering.


这篇关于解析超过一百万个xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆