使用子过程非常大的输入和管道 [英] Very large input and piping using subprocess.Popen
问题描述
我有一个非常简单的问题.我有一个大文件,它需要经历三个步骤,一个使用外部程序的解码步骤,使用python进行的一些处理,然后使用另一个外部程序进行重新编码.我一直在使用subprocess.Popen()尝试在python中执行此操作,而不是形成unix管道.但是,所有数据都被缓冲到内存中.有没有执行此任务的pythonic方式,还是我最好还是回到一个简单的python脚本,该脚本从stdin读取并在两侧使用unix管道写入stdout?
I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?
import os, sys, subprocess
def main(infile,reflist):
print infile,reflist
samtoolsin = subprocess.Popen(["samtools","view",infile],
stdout=subprocess.PIPE,bufsize=1)
samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
for line in samtoolsin.stdout.read():
if(line.startswith("@")):
samtoolsout.stdin.write(line)
else:
linesplit = line.split("\t")
if(linesplit[10]=="*"):
linesplit[9]="*"
samtoolsout.stdin.write("\t".join(linesplit))
推荐答案
Popen具有一个bufsize
参数,该参数将限制内存中缓冲区的大小.如果根本不希望文件在内存中,则可以将文件对象作为stdin
和stdout
参数传递.从子流程文档:
Popen has a bufsize
parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin
and stdout
parameters. From the subprocess docs:
bufsize(如果给定的话)与内置open()函数的相应参数的含义相同:0表示未缓冲,1表示行缓冲,任何其他正值表示使用(大约)该大小的缓冲区.负bufsize表示使用系统默认值,通常表示已完全缓冲. bufsize的默认值为0(无缓冲).
bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).
这篇关于使用子过程非常大的输入和管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!