优化python程序以同时解析两个大文件 [英] Optimize python program to parse two large files at the same time

查看:121
本文介绍了优化python程序以同时解析两个大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试同时使用Python3解析两个大文件.如此处所示:

I am trying to parse two large files with Python3 at the same time. As shown here:

dict = {}
row = {}
with open(file1, "r") as f1, open(file2, "r") as f2:
  zipped = zip(f1, f2)
  for line_f1, line_f2 in zipped:
    # parse the lines and save the line information in a dictionary 
    row = {"ID_1":line_f1[0], "ID_2":line_f2[0], ...}

    # This process takes roughly 0.0005s each time
    # it parses each pair of lines at once and returns an output
    # it doesn't depend on previous lines or lines after
    output = process(row) 

    # output is a string, add it to dict
    if output in dict:
       dict[output] += 1
    else:
       dict[output] = 1
return dict

当我用两个较小的文本文件(每个30,000行,文件大小= 13M)测试上面的代码时,大约需要150s来完成循环.

When I tested the above code with two smaller text files (30,000 lines each, file size = 13M) and it takes roughly 150s to finish the loop.

当我用两个大文本文件(每个文件有9,000,000行,文件大小= 3.8G)进行测试时,如果没有循环中的处理步骤,则大约需要670s的时间.

When I tested with two large text files (9,000,000 lines each, file size = 3.8G) without the process step in the loop it takes roughly 670s.

当我在处理步骤中使用相同的两个大文本文件进行测试时.我为每10,000个项目计时,大约需要60秒钟.迭代次数变大的时间并没有增加.

When I tested with the same two large text files with the process step. I timed that for every 10,000 items it will take roughly 60s. The time didn't grow when the number of iterations gets large.

但是,当我将此作业提交到共享群集时,一对大文件需要36个小时以上才能完成处理.我试图找出是否还有其他方法来处理文件,以便可以更快.任何建议,将不胜感激.

However, when I submit this job to a shared cluster it takes more than 36 hours for one pair of large files to finish processing. I am trying to figure out if there is any other way to process the files so it can be faster. Any suggestions would be appreciated.

提前谢谢!

推荐答案

这只是一个假设,但是您的进程每次触发I/O来获取一对线路时,可能会浪费其分配的CPU插槽.您可以尝试一次读取行的行并分块进行处理,以便可以充分利用共享群集上的每个CPU时隙.

This is just a hypothesis, but your process could be wasting its allocated CPU slot every time it triggers an I/O to get a pair of lines. You could try reading groups of lines at a time and processing in chunks so you can make the most of each CPU time slot you get on the shared cluster.

from collections import deque
chunkSize = 1000000 # number of characters in each chunk (you will need to adjust this)
chunk1    = deque([""]) #buffered lines from 1st file
chunk2    = deque([""]) #buffered lines from 2nd file
with open(file1, "r") as f1, open(file2, "r") as f2:
    while chunk1 and chunk2:
        line_f1 = chunk1.popleft()
        if not chunk1:
            line_f1,*more = (line_f1+file1.read(chunkSize)).split("\n")
            chunk1.extend(more)
        line_f2 = chunk2.popleft()
        if not chunk2:
            line_f2,*more = (line_f2+file2.read(chunkSize)).split("\n")
            chunk2.extend(more)
        # process line_f1, line_f2
        ....

这是通过读取一大堆字符(必须大于最长的行)并将其分解成几行来实现的.将这些行放在队列中进行处理.

The way this works is by reading a chunk of characters (which must be larger than your longest line) and breaking it down into lines. The lines are placed in a queue for processing.

由于块大小用字符数表示,因此队列中的最后一行可能不完整.

Because the chunksize is expressed in number of characters, the last line in the queue may be incomplete.

为确保行在处理之前完成,当我们到达队列的最后一行时,将读取另一个块.附加字符将添加到不完整行的末尾,并在合并的字符串上执行行拆分.因为我们串联了最后一行(不完整的),所以 .split("\ n")函数始终适用于从行边界开始的一小段文本.

To ensure that lines are complete before being processed, another chunk is read when we get to the last line in the queue. The additional characters are added to the end of the incomplete line and the line splitting is performed on the combined string. Because we concatenated the last (incomplete) line, the .split("\n") function always applies to a chunk of text that begins at a line boundary.

该过程从最后一行(现已完成)继续进行,其余各行被添加到队列中.

The process continues with the (now completed) last line and the rest of the lines are added to the queue.

这篇关于优化python程序以同时解析两个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆