从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法 [英] quickest way to select/copy lines containing string from huge txt.gz file

查看：103 发布时间：2021/5/9 20:46:25 linux ubuntu awk sed grep

本文介绍了从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，我有以下 sed 一种衬纸:

So I have the following sed one liner:

sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt

我有很多以以下任何一个开头的行:

I have many lines that start with either:

S |
T |
#D =
##
H |
Q |

S|
T|
#D=
##
H|
Q|

这个想法是不复制前四行之一开始的行，然后用，H | 和 Q | (在行首)用替换 H | (在行首),, Q |

The idea is to not copy the lines starting with one of the first fours and to replace H| (at the beginning of lines) by ,H| and Q| (at the beginning of lines) by ,,Q|

但是现在我需要:

以最快的方式使用(互联网建议(m)awk比sed更快)
从.txt.gz文件中读取，并将结果保存到.txt.gz文件中，如果可能的话，避免使用中间的unzip/re-zip

实际上，有数百个.txt.gz文件以这种方式进行处理(每个文件约1GB)(都在同一文件夹中).是否有CLI方式可以在所有代码上并行运行代码(因此将为每个内核分配目录中文件的子集)?

there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?

-我使用linux --ubuntu

--I use linux --ubuntu

推荐答案

未经测试，但可能与 GNU Parallel 非常接近.

Untested, but likely pretty close to this with GNU Parallel.

首先创建输出目录，以免覆盖任何有价值的数据:

First make output directory so as not to overwrite any valuable data:

mkdir -p output

现在声明一个只执行一个文件并将其导出到子进程的函数，以便由 GNU Parallel 启动的作业可以找到它:

Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:

doit(){
    echo Processing $1
    gzcat "$1" | awk '
        /^[ST]\|/ || /^#D=/ || /^##/ {next}    # ignore lines starting S|, T| etc
        /^H\|/ {print ","}                     # prefix "H|" with ","
        /^Q\|/ {print ",,"}                    # prefix "Q|" with ",,"
        1                                      # print all other lines
    ' | gzip > output/"$1"
}
export -f doit

现在并行处理所有 txt.gz 文件并显示进度栏:

Now process all txt.gz files in parallel and show progress bar too:

parallel --bar doit ::: *txt.gz

这篇关于从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法 [英] quickest way to select/copy lines containing string from huge txt.gz file

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法 [英] quickest way to select/copy lines containing string from huge txt.gz file

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭