为大（27GB）文件的速度更快的grep功能 [英] Faster grep function for big (27GB) files

查看：1151 发布时间：2016/7/28 15:09:47 file bash awk grep

本文介绍了为大（27GB）文件的速度更快的grep功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从含有的特定字符串从一个大的文件（27GB）相同的字符串（和其它信息）的文件（5MB）到grep。
为了加快分析我分裂27GB文件到1GB的文件，然后采用以下脚本（有一些人的帮助下在这里）。然而，它不是很有效的（以产生一个180KB文件需要花费30小时！）。

下面的脚本。有没有更合适的工具比grep的？或者更有效的方式来使用grep？

 ＃！/斌/庆典NR_CPUS = 4
数= 0
在`回声{a到z}`Z者除外;
做
 在`回声{a到z}`X;
 做
  在`回声{a到z}`ÿ;
  做
   在$ IDS（猫input.sam | awk的'{$打印1}）;
   做
    grep的$ IDS样_$ Z，$ x的$ Y| awk的'{打印$ 1$ 10$ 11}'＆GT;＆GT; output.txt的＆安培;
    让数+ = 1
                                [[$（（计数％NR_CPUS））-eq 0]]＆功放;＆安培;等待
   DONE
  做过＃＆安培;

解决方案

有几件事情，你可以尝试：

1）你正在阅读 input.sam 多次。它只需要你的第一个循环开始前被读取一次。的ID保存到将由的grep 读取的临时文件。

2）preFIX与 LC_ALL = C 您的grep命令使用C语言环境，而不是UTF-8。这将加快的grep 。

3）使用 fgrep一样，因为你正在寻找一个固定的字符串，而不是常规的前pression。

4）使用 -f ，使的grep 从文件中读取模式，而不是使用循环。

5）不要写从多个进程输出文件，你可以用交错的线条和损坏的文件中结束。

进行这些更改后，这是你的脚本将变成什么：

 的awk'{$打印1}input.sam＆GT; idsFile.txt
对于z {} A..Z
做
 在X {} A..Z
 做
  在Ÿ{} A..Z
  做
    LC_ALL = C比fgrep -f idsFile.txt样_$ Z，$ x的$ Y| AWK'{打印$ 1，$ 10，$ 11}'
  完成＆gt;＆GT; output.txt的

此外，检查出 GNU并行的，其目的是帮助你在并行运行的作业。

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).

Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?

#!/bin/bash

NR_CPUS=4
count=0


for z in `echo {a..z}` ;
do
 for x in `echo {a..z}` ;
 do
  for y in `echo {a..z}` ;
  do
   for ids in $(cat input.sam|awk '{print $1}');  
   do 
    grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
    let count+=1
                                [[ $((count%NR_CPUS)) -eq 0 ]] && wait
   done
  done #&

解决方案

A few things you can try:

1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.

2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.

3) Use fgrep because you're searching for a fixed string, not a regular expression.

4) Use -f to make grep read patterns from a file, rather than using a loop.

5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.

After making those changes, this is what your script would become:

awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
 for x in {a..z}
 do
  for y in {a..z}
  do
    LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
  done >> output.txt

Also, check out GNU Parallel which is designed to help you run jobs in parallel.

这篇关于为大（27GB）文件的速度更快的grep功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为大（27GB）文件的速度更快的grep功能 [英] Faster grep function for big (27GB) files

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

为大（27GB）文件的速度更快的grep功能 [英] Faster grep function for big (27GB) files

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭