为什么使用-w(--word-regexp)标志,grep这么慢并且占用大量内存? [英] Why is grep so slow and memory intensive with -w (--word-regexp) flag?
问题描述
我在文件和数据文件中有一个ID列表(大小约为3.2Gb),我想在数据文件中提取包含ID以及下一行的行.我做了以下事情:
I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:
grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data
这行得通,但也提取了不需要的子字符串,例如,如果id为EA4
,它也会使用EA40
提取行.
This worked, but also extracted unwanted substrings, for example if the id is EA4
it also pulled out the lines with EA40
.
因此,我尝试使用相同的命令,但在第一个grep
中添加了-w
(--word-regexp
)标志以匹配整个单词.但是,我发现我的命令现在运行了超过1个小时(而不是约26秒),并且还开始使用10 GB的内存,因此我不得不取消该工作.
So I tried using the same command but adding the -w
(--word-regexp
) flag to the first grep
to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.
为什么添加-w
会使命令如此缓慢并占用内存?如何有效运行此命令以获得所需的输出?谢谢
Why did adding -w
make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you
file.ids
看起来像这样:
>EA4
>EA9
file.data
看起来像这样:
>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here
output.data
看起来像这样:
>EA4 text
data
>EA9 text_again
data_here
推荐答案
grep -F string file
只是在文件中查找string
的出现,但是grep -w -F string file
也必须检查string
之前和之后的每个字符以查看是否为文字字符.这是很多的额外工作,它的一种可能的实现方式是:首先将行分隔成每个可能的,非单词字符分隔的字符串,这些字符串当然会重叠,这样会占用大量内存,但是idk是否是造成内存使用问题的原因.
grep -F string file
is simply looking for occurrences of string
in the file but grep -w -F string file
has to check each character before and after string
too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not.
在任何情况下,grep都是这项工作的错误工具,因为您只想与输入文件中的特定字段进行匹配,因此应该使用awk代替:
In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead:
$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
data
>EA9 text_again
data_here
以上假设您的数据"行不能以>
开头.如果他们能告诉我们如何识别数据线和ID线.
The above assumes your "data" lines cannot start with >
. If they can then tell us how to identify data lines vs id lines.
请注意,无论您在id
行之间有多少data
行,即使有0或100:
Note that the above will work no matter how many data
lines you have between id
lines, even if there's 0 or 100:
$ cat file.data
>EA4 text
>E40 blah
more_data
>EA9 text_again
data 1
data 2
data 3
$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
>EA9 text_again
data 1
data 2
data 3
此外,您无需将输出通过管道传输到grep -v
:
Also, you don't need to pipe the output to grep -v
:
grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data
只需在一个脚本中完成所有操作即可
just do it all in the one script:
awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data
这篇关于为什么使用-w(--word-regexp)标志,grep这么慢并且占用大量内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!