从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法 [英] quickest way to select/copy lines containing string from huge txt.gz file
问题描述
因此,我有以下 sed
一种衬纸:
So I have the following sed
one liner:
sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt
我有很多以以下任何一个开头的行:
I have many lines that start with either:
-
S |
-
T |
-
#D =
-
##
-
H |
-
Q |
S|
T|
#D=
##
H|
Q|
这个想法是不复制前四行之一开始的行,然后用,H |
和 Q |
(在行首)用替换
H |
(在行首),, Q |
The idea is to not copy the lines starting with one of the first fours and
to replace H|
(at the beginning of lines) by ,H|
and Q|
(at the beginning of lines) by ,,Q|
但是现在我需要:
- 以最快的方式使用(互联网建议(m)awk比sed更快)
- 从.txt.gz文件中读取,并将结果保存到.txt.gz文件中,如果可能的话,避免使用中间的unzip/re-zip
实际上,有数百个.txt.gz文件以这种方式进行处理(每个文件约1GB)(都在同一文件夹中).是否有CLI方式可以在所有代码上并行运行代码(因此将为每个内核分配目录中文件的子集)?
there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?
-我使用linux --ubuntu
--I use linux --ubuntu
推荐答案
未经测试,但可能与 GNU Parallel 非常接近.
Untested, but likely pretty close to this with GNU Parallel.
首先创建输出目录,以免覆盖任何有价值的数据:
First make output directory so as not to overwrite any valuable data:
mkdir -p output
现在声明一个只执行一个文件并将其导出到子进程的函数,以便由 GNU Parallel 启动的作业可以找到它:
Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:
doit(){
echo Processing $1
gzcat "$1" | awk '
/^[ST]\|/ || /^#D=/ || /^##/ {next} # ignore lines starting S|, T| etc
/^H\|/ {print ","} # prefix "H|" with ","
/^Q\|/ {print ",,"} # prefix "Q|" with ",,"
1 # print all other lines
' | gzip > output/"$1"
}
export -f doit
现在并行处理所有 txt.gz
文件并显示进度栏:
Now process all txt.gz
files in parallel and show progress bar too:
parallel --bar doit ::: *txt.gz
这篇关于从巨大的txt.gz文件中选择/复制包含字符串的行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!