为大(27GB)文件的速度更快的grep功能 [英] Faster grep function for big (27GB) files
问题描述
我从含有的特定字符串从一个大的文件(27GB)相同的字符串(和其它信息)的文件(5MB)到grep。
为了加快分析我分裂27GB文件到1GB的文件,然后采用以下脚本(有一些人的帮助下在这里)。然而,它不是很有效的(以产生一个180KB文件需要花费30小时!)。
下面的脚本。有没有更合适的工具比grep的?或者更有效的方式来使用grep?
#!/斌/庆典NR_CPUS = 4
数= 0
在`回声{a到z}`Z者除外;
做
在`回声{a到z}`X;
做
在`回声{a到z}`ÿ;
做
在$ IDS(猫input.sam | awk的'{$打印1});
做
grep的$ IDS样_$ Z,$ x的$ Y| awk的'{打印$ 1$ 10$ 11}'>> output.txt的&安培;
让数+ = 1
[[$((计数%NR_CPUS))-eq 0]]&功放;&安培;等待
DONE
做过#&安培;
有几件事情,你可以尝试:
1)你正在阅读 input.sam
多次。它只需要你的第一个循环开始前被读取一次。的ID保存到将由的grep
读取的临时文件。
2)preFIX与 LC_ALL = C
您的grep命令使用C语言环境,而不是UTF-8。这将加快的grep
。
3)使用 fgrep一样
,因为你正在寻找一个固定的字符串,而不是常规的前pression。
4)使用 -f
,使的grep
从文件中读取模式,而不是使用循环。
5)不要写从多个进程输出文件,你可以用交错的线条和损坏的文件中结束。
进行这些更改后,这是你的脚本将变成什么:
的awk'{$打印1}input.sam> idsFile.txt
对于z {} A..Z
做
在X {} A..Z
做
在Ÿ{} A..Z
做
LC_ALL = C比fgrep -f idsFile.txt样_$ Z,$ x的$ Y| AWK'{打印$ 1,$ 10,$ 11}'
完成>> output.txt的
此外,检查出 GNU并行的,其目的是帮助你在并行运行的作业。
I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading input.sam
multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep
.
2) Prefix your grep command with LC_ALL=C
to use the C locale instead of UTF-8. This will speed up grep
.
3) Use fgrep
because you're searching for a fixed string, not a regular expression.
4) Use -f
to make grep
read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.
这篇关于为大(27GB)文件的速度更快的grep功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!