具有多线程的Grep [英] Grep with multi-threading
问题描述
我有以下具有30233088字符串的(大)文件:
I have the following (big) file with 30233088 strings:
head mystringfile.txt:
GAATGAACACGAAGAA
GAATGAACACGAAGAC
GAATGAACACGAAGAG
GAATGAACACGAAGCA
cat sequence.txt
<代码> AAATAGAGGGCGGTCCAGGCGTGTCGAAACACTGGGTCCAGGGCAAGAGCGGTTCGGGTGTCAGGAAAGCCCCCAAGGGGGTTCGCGCGGTTTGCAGTGAGGTAGAGGCCGGTGTATGGGTAGACAATTGGGGTCCCAAAGAAAAAGGCTCGTCCAACATCATAATAAACCCAAGCACGATAAAAAGCAAACGCAGACTTCAATAGGGTACGAGCAATTGTGGCAGGGTGCTCGCTGTCAGGGTTAGATCTTCTTGGAGTCGCGTCGCTCGGGGGGGCAAGGCCAACGTAAGATCGTGGCTGATCGCTGGCAATGCGGTCGGTTGGGTGGTCGCTAGTAGGGGCACGGCGGTCTCTTATGGCGTCGTAAAATGCGTCTCCAAAGCGAAAAGGGGCGGCAGACAAGTCACCGGGCAAGCTTAGAGGTCTGGGGCCCGTGGCTTTAGGGGAATGAACACGAAGACGCGAAACGAAGTCGTGTTTCTTGTTGGCTGTAGAGGGGAAAACCGTCTGGGGCGATCTGGCGTAGTAGTGCGTGTCTTGCAGTGAGCTCCCCGTCCGTAAGGATTCGCAGGAATCCTGCGTGAAGCTCGGTCGTCTCGGCCGTGTCTCGGGGTTTGATTGCGGGTTCAGATTGGAAAGGTCTCCTCGGGTCGTTTGCTGCATTTGCTCGCAACCCTGACGTGAAAGGGGTGAGCTGTCTCCAATCTGCCACGCTGGGTGTTGCGTCGTCAGTAAAAGACTTGGTCAAGCTGGGACCTCGCAAGATCGCGAGAGGGTTAAGCACAAAAGGTATGGCGAAGCTCCCGGGTGCTCTTGTGGCCACCCAGAATCATGGTGACGTAGGTTTTGCGAAGCCATCAAAAATTCAGGCGGCAAAACGAGCCAGTAGGGTCCTGGGCAGCTGGGCTTGTAGTGGGTAGGCGGCAAAACGCAAAGAATGAACACGAAGCAACTCCGTAGTGTGACGGGGGTTCTGACAAACGTCCTGCAAGAAGTTCGTCTTGGG
我需要在另一个序列文件中 grep
来确定匹配的位置,具体操作如下:
which I need to grep
in another sequence file to determine the position of the match, which I do as following:
while read line; do grep -b -o $line sequence.txt >>sequence.txt.count; done<mystringfile.txt
像这样运行代码当然要花很长时间,并且只运行1个线程的一部分,所以如何修改它(使用 parallel
或 xargs
?)这样它就可以在我要指定的尽可能多的线程上运行?
Running the code like this of course takes a very long time and just runs part of 1 thread, so how can I modify it (with parallel
or xargs
?) so that it is running on as many threads as I want to specify?
推荐答案
使用shell循环处理文本,您的想法是错误的.您正在打开一个新的文件描述符,以针对输入文件上的30233088迭代中的每个迭代重定向到输出文件.这很可能会对性能产生巨大影响,或者容易导致打开的文件描述符用尽.
Your idea is wrong with using shell loop to process text. You are opening one new file descriptor for re-directing to output file for each of the 30233088 iterations on your input file. It is prone to a huge performance impact or a case of running out of open file descriptors.
使用正确的工具进行作业. Awk
是您的朋友在这里.如果您说的 sequence.txt
只是一个很大的模式,您可以将其插入一个变量以进行正则表达式匹配,如下所示.该解决方案不涉及将条目存储在RAM中的内存开销
Use the right tool for the job. Awk
is your friend here. If the sequence.txt
is just a huge pattern as you say, you can just slurp it into a variable for regex match as below. The solutions involves no memory overhead of having to store entries in RAM
awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt
这应该比您所采用的方法相对更快,并进一步加快处理速度,更改您的 locale
设置以匹配 C
本地
This should be relatively faster than the approach you have and to speed up things further, change your locale
settings to match the C
local,
LC_ALL=C awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt
要与 -b
的 grep
选项匹配以打印字节偏移开始,请在上述答案中使用 n-1
而不只是 n
.
To match with the grep
's option of -b
to print the byte offset start, use n-1
in the answer above instead of just n
.
如果仍要使用GNU并行,请使用-pipepart
将文件物理拆分为多个部分,并指定-block
大小为文件的MB数要阅读的内容
If you still want to use GNU parallel, use --pipepart
to split the file physically into parts and specify the --block
size to how much MB of file content to read
parallel -a mystringfile.txt --pipepart --block=20M -q awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}'
这篇关于具有多线程的Grep的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!