具有多线程的Grep [英] Grep with multi-threading

查看:154
本文介绍了具有多线程的Grep的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下具有30233088字符串的(大)文件:

I have the following (big) file with 30233088 strings:

head mystringfile.txt:

GAATGAACACGAAGAA
GAATGAACACGAAGAC
GAATGAACACGAAGAG
GAATGAACACGAAGCA

cat sequence.txt

<代码> AAATAGAGGGCGGTCCAGGCGTGTCGAAACACTGGGTCCAGGGCAAGAGCGGTTCGGGTGTCAGGAAAGCCCCCAAGGGGGTTCGCGCGGTTTGCAGTGAGGTAGAGGCCGGTGTATGGGTAGACAATTGGGGTCCCAAAGAAAAAGGCTCGTCCAACATCATAATAAACCCAAGCACGATAAAAAGCAAACGCAGACTTCAATAGGGTACGAGCAATTGTGGCAGGGTGCTCGCTGTCAGGGTTAGATCTTCTTGGAGTCGCGTCGCTCGGGGGGGCAAGGCCAACGTAAGATCGTGGCTGATCGCTGGCAATGCGGTCGGTTGGGTGGTCGCTAGTAGGGGCACGGCGGTCTCTTATGGCGTCGTAAAATGCGTCTCCAAAGCGAAAAGGGGCGGCAGACAAGTCACCGGGCAAGCTTAGAGGTCTGGGGCCCGTGGCTTTAGGGGAATGAACACGAAGACGCGAAACGAAGTCGTGTTTCTTGTTGGCTGTAGAGGGGAAAACCGTCTGGGGCGATCTGGCGTAGTAGTGCGTGTCTTGCAGTGAGCTCCCCGTCCGTAAGGATTCGCAGGAATCCTGCGTGAAGCTCGGTCGTCTCGGCCGTGTCTCGGGGTTTGATTGCGGGTTCAGATTGGAAAGGTCTCCTCGGGTCGTTTGCTGCATTTGCTCGCAACCCTGACGTGAAAGGGGTGAGCTGTCTCCAATCTGCCACGCTGGGTGTTGCGTCGTCAGTAAAAGACTTGGTCAAGCTGGGACCTCGCAAGATCGCGAGAGGGTTAAGCACAAAAGGTATGGCGAAGCTCCCGGGTGCTCTTGTGGCCACCCAGAATCATGGTGACGTAGGTTTTGCGAAGCCATCAAAAATTCAGGCGGCAAAACGAGCCAGTAGGGTCCTGGGCAGCTGGGCTTGTAGTGGGTAGGCGGCAAAACGCAAAGAATGAACACGAAGCAACTCCGTAGTGTGACGGGGGTTCTGACAAACGTCCTGCAAGAAGTTCGTCTTGGG

我需要在另一个序列文件中 grep 来确定匹配的位置,具体操作如下:

which I need to grep in another sequence file to determine the position of the match, which I do as following:

while read line; do grep -b -o $line sequence.txt >>sequence.txt.count; done<mystringfile.txt

像这样运行代码当然要花很长时间,并且只运行1个线程的一部分,所以如何修改它(使用 parallel xargs ?)这样它就可以在我要指定的尽可能多的线程上运行?

Running the code like this of course takes a very long time and just runs part of 1 thread, so how can I modify it (with parallel or xargs?) so that it is running on as many threads as I want to specify?

推荐答案

使用shell循环处理文本,您的想法是错误的.您正在打开一个新的文件描述符,以针对输入文件上的30233088迭代中的每个迭代重定向到输出文件.这很可能会对性能产生巨大影响,或者容易导致打开的文件描述符用尽.

Your idea is wrong with using shell loop to process text. You are opening one new file descriptor for re-directing to output file for each of the 30233088 iterations on your input file. It is prone to a huge performance impact or a case of running out of open file descriptors.

使用正确的工具进行作业. Awk 是您的朋友在这里.如果您说的 sequence.txt 只是一个很大的模式,您可以将其插入一个变量以进行正则表达式匹配,如下所示.该解决方案不涉及将条目存储在RAM中的内存开销

Use the right tool for the job. Awk is your friend here. If the sequence.txt is just a huge pattern as you say, you can just slurp it into a variable for regex match as below. The solutions involves no memory overhead of having to store entries in RAM

awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt

这应该比您所采用的方法相对更快,并进一步加快处理速度,更改您的 locale 设置以匹配 C 本地

This should be relatively faster than the approach you have and to speed up things further, change your locale settings to match the C local,

LC_ALL=C awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt

要与 -b grep 选项匹配以打印字节偏移开始,请在上述答案中使用 n-1 而不只是 n .

To match with the grep's option of -b to print the byte offset start, use n-1 in the answer above instead of just n.

如果仍要使用GNU并行,请使用-pipepart 将文件物理拆分为多个部分,并指定-block 大小为文件的MB数要阅读的内容

If you still want to use GNU parallel, use --pipepart to split the file physically into parts and specify the --block size to how much MB of file content to read

parallel -a mystringfile.txt --pipepart --block=20M -q awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}'

这篇关于具有多线程的Grep的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆