Python,并行处理大型文本文件 [英] Python, process a large text file in parallel

查看:485
本文介绍了Python,并行处理大型文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对数据文件(SAM文件)中的记录进行采样:

Samples records in the data file (SAM file):

M01383  0  chr4  66439384  255  31M  *  0  0  AAGAGGA GFAFHGD  MD:Z:31 NM:i:0
M01382  0  chr1  241995435  255 31M  *  0  0  ATCCAAG AFHTTAG  MD:Z:31 NM:i:0
......

  • 数据文件是逐行显示的
  • 数据文件的大小在1G-5G之间.
  • 我需要逐行浏览数据文件中的记录,从每一行中获取特定值(例如,第四值,66439384),然后将此值传递给另一个函数进行处理.然后将更新一些结果计数器.

    I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.

    基本工作流程如下:

    # global variable, counters will be updated in search function according to the value passed. 
    counter_a = 0    
    counter_b = 0
    counter_c = 0
    
    open textfile:
        for line in textfile:
            value = line.split()[3]
            search_function(value)    # this function takes abit long time to process
    
    def search_function (value):
        some conditions checking:
            update the counter_a or counter_b or counter_c
    

    使用单个过程代码和大约1.5G数据文件,大约需要20个小时才能浏览一个数据文件中的所有记录.我需要更快的代码,因为有超过30种此类数据文件.

    With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.

    我当时想并行处理N个数据块,每个数据块将执行上述工作流程并同时更新全局变量(counter_a,counter_b,counter_c).但是我不知道如何在代码中实现这一目标,或者这是否行得通.

    I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.

    我可以访问带有以下设备的服务器:24个处理器和大约40G的RAM.

    I have access to a server machine with: 24 processors and around 40G RAM.

    任何人都可以帮忙吗?非常感谢.

    Anyone could help with this? Thanks very much.

    推荐答案

    最简单的方法可能是使用现有代码一次处理所有30个文件-仍然需要花费一整天的时间,但是您将拥有所有文件一次完成. (即,"9个月内有9个婴儿"很容易,"1个月内有1个婴儿"很难)

    The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)

    如果您真的想更快地完成单个文件,则取决于计数器的实际更新方式.如果几乎所有工作都只是在分析价值,那么您可以使用多处理模块来卸载它:

    If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:

    import time
    import multiprocessing
    
    def slowfunc(value):
        time.sleep(0.01)
        return value**2 + 0.3*value + 1
    
    counter_a = counter_b = counter_c = 0
    def add_to_counter(res):
        global counter_a, counter_b, counter_c
        counter_a += res
        counter_b -= (res - 10)**2
        counter_c += (int(res) % 2)
    
    pool = multiprocessing.Pool(50)
    results = []
    
    for value in range(100000):
        r = pool.apply_async(slowfunc, [value])
        results.append(r)
    
        # don't let the queue grow too long
        if len(results) == 1000:
            results[0].wait()
    
        while results and results[0].ready():
            r = results.pop(0)
            add_to_counter(r.get())
    
    for r in results:
        r.wait()
        add_to_counter(r.get())
    
    print counter_a, counter_b, counter_c
    

    这将允许50个lowfuncs并行运行,因此与其花费1000s(= 100k * 0.01s),不如花费20s(100k/50)* 0.01s来完成.如果您可以像上面那样将函数重组为"slowfunc"和"add_to_counter",则应该可以使速度提高24倍.

    That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.

    这篇关于Python,并行处理大型文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆