为大(27GB)文件的速度更快的grep功能 [英] Faster grep function for big (27GB) files

查看:1151
本文介绍了为大(27GB)文件的速度更快的grep功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从含有的特定字符串从一个大的文件(27GB)相同的字符串(和其它信息)的文件(5MB)到grep。
为了加快分析我分裂27GB文件到1GB的文件,然后采用以下脚本(有一些人的帮助下在这里)。然而,它不是很有效的(以产生一个180KB文件需要花费30小时!)。

下面的脚本。有没有更合适的工具比grep的?或者更有效的方式来使用grep?

 #!/斌/庆典NR_CPUS = 4
数= 0
在`回声{a到z}`Z者除外;

 在`回声{a到z}`X;
 做
  在`回声{a到z}`ÿ;
  做
   在$ IDS(猫input.sam | awk的'{$打印1});
   做
    grep的$ IDS样_$ Z,$ x的$ Y| awk的'{打印$ 1$ 10$ 11}'>> output.txt的&安培;
    让数+ = 1
                                [[$((计数%NR_CPUS))-eq 0]]&功放;&安培;等待
   DONE
  做过#&安培;


解决方案

有几件事情,你可以尝试:

1)你正在阅读 input.sam 多次。它只需要你的第一个循环开始前被读取一次。的ID保存到将由的grep 读取的临时文件。

2)preFIX与 LC_ALL = C 您的grep命令使用C语言环境,而不是UTF-8。这将加快的grep

3)使用 fgrep一样,因为你正在寻找一个固定的字符串,而不是常规的前pression。

4)使用 -f ,使的grep 从文件中读取模式,而不是使用循环。

5)不要写从多个进程输出文件,你可以用交错的线条和损坏的文件中结束。

进行这些更改后,这是你的脚本将变成什么:

 的awk'{$打印1}input.sam> idsFile.txt
对于z {} A..Z

 在X {} A..Z
 做
  在Ÿ{} A..Z
  做
    LC_ALL = C比fgrep -f idsFile.txt样_$ Z,$ x的$ Y| AWK'{打印$ 1,$ 10,$ 11}'
  完成>> output.txt的

此外,检查出 GNU并行的,其目的是帮助你在并行运行的作业。

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).

Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?

#!/bin/bash

NR_CPUS=4
count=0


for z in `echo {a..z}` ;
do
 for x in `echo {a..z}` ;
 do
  for y in `echo {a..z}` ;
  do
   for ids in $(cat input.sam|awk '{print $1}');  
   do 
    grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
    let count+=1
                                [[ $((count%NR_CPUS)) -eq 0 ]] && wait
   done
  done #&

解决方案

A few things you can try:

1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.

2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.

3) Use fgrep because you're searching for a fixed string, not a regular expression.

4) Use -f to make grep read patterns from a file, rather than using a loop.

5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.

After making those changes, this is what your script would become:

awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
 for x in {a..z}
 do
  for y in {a..z}
  do
    LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
  done >> output.txt

Also, check out GNU Parallel which is designed to help you run jobs in parallel.

这篇关于为大(27GB)文件的速度更快的grep功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆