Python,并行处理大型文本文件 [英] Python, process a large text file in parallel
问题描述
对数据文件(SAM文件)中的记录进行采样:
Samples records in the data file (SAM file):
M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0
M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0
......
- 数据文件是逐行显示的
- 数据文件的大小在1G-5G之间.
我需要逐行浏览数据文件中的记录,从每一行中获取特定值(例如,第四值,66439384),然后将此值传递给另一个函数进行处理.然后将更新一些结果计数器.
I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.
基本工作流程如下:
# global variable, counters will be updated in search function according to the value passed.
counter_a = 0
counter_b = 0
counter_c = 0
open textfile:
for line in textfile:
value = line.split()[3]
search_function(value) # this function takes abit long time to process
def search_function (value):
some conditions checking:
update the counter_a or counter_b or counter_c
使用单个过程代码和大约1.5G数据文件,大约需要20个小时才能浏览一个数据文件中的所有记录.我需要更快的代码,因为有超过30种此类数据文件.
With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.
我当时想并行处理N个数据块,每个数据块将执行上述工作流程并同时更新全局变量(counter_a,counter_b,counter_c).但是我不知道如何在代码中实现这一目标,或者这是否行得通.
I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.
我可以访问带有以下设备的服务器:24个处理器和大约40G的RAM.
I have access to a server machine with: 24 processors and around 40G RAM.
任何人都可以帮忙吗?非常感谢.
Anyone could help with this? Thanks very much.
推荐答案
最简单的方法可能是使用现有代码一次处理所有30个文件-仍然需要花费一整天的时间,但是您将拥有所有文件一次完成. (即,"9个月内有9个婴儿"很容易,"1个月内有1个婴儿"很难)
The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)
如果您真的想更快地完成单个文件,则取决于计数器的实际更新方式.如果几乎所有工作都只是在分析价值,那么您可以使用多处理模块来卸载它:
If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:
import time
import multiprocessing
def slowfunc(value):
time.sleep(0.01)
return value**2 + 0.3*value + 1
counter_a = counter_b = counter_c = 0
def add_to_counter(res):
global counter_a, counter_b, counter_c
counter_a += res
counter_b -= (res - 10)**2
counter_c += (int(res) % 2)
pool = multiprocessing.Pool(50)
results = []
for value in range(100000):
r = pool.apply_async(slowfunc, [value])
results.append(r)
# don't let the queue grow too long
if len(results) == 1000:
results[0].wait()
while results and results[0].ready():
r = results.pop(0)
add_to_counter(r.get())
for r in results:
r.wait()
add_to_counter(r.get())
print counter_a, counter_b, counter_c
这将允许50个lowfuncs并行运行,因此与其花费1000s(= 100k * 0.01s),不如花费20s(100k/50)* 0.01s来完成.如果您可以像上面那样将函数重组为"slowfunc"和"add_to_counter",则应该可以使速度提高24倍.
That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.
这篇关于Python,并行处理大型文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!