为什么使用-w(--word-regexp)标志,grep这么慢并且占用大量内存? [英] Why is grep so slow and memory intensive with -w (--word-regexp) flag?

查看:209
本文介绍了为什么使用-w(--word-regexp)标志,grep这么慢并且占用大量内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在文件和数据文件中有一个ID列表(大小约为3.2Gb),我想在数据文件中提取包含ID以及下一行的行.我做了以下事情:

I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

这行得通,但也提取了不需要的子字符串,例如,如果id为EA4,它也会使用EA40提取行.

This worked, but also extracted unwanted substrings, for example if the id is EA4 it also pulled out the lines with EA40.

因此,我尝试使用相同的命令,但在第一个grep中添加了-w(--word-regexp)标志以匹配整个单词.但是,我发现我的命令现在运行了超过1个小时(而不是约26秒),并且还开始使用10 GB的内存,因此我不得不取消该工作.

So I tried using the same command but adding the -w (--word-regexp) flag to the first grep to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.

为什么添加-w会使命令如此缓慢并占用内存?如何有效运行此命令以获得所需的输出?谢谢

Why did adding -w make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you

file.ids看起来像这样:

>EA4
>EA9

file.data看起来像这样:

>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here

output.data看起来像这样:

>EA4 text
data
>EA9 text_again
data_here

推荐答案

grep -F string file只是在文件中查找string的出现,但是grep -w -F string file也必须检查string之前和之后的每个字符以查看是否为文字字符.这是很多的额外工作,它的一种可能的实现方式是:首先将行分隔成每个可能的,非单词字符分隔的字符串,这些字符串当然会重叠,这样会占用大量内存,但是idk是否是造成内存使用问题的原因.

grep -F string file is simply looking for occurrences of string in the file but grep -w -F string file has to check each character before and after string too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not.

在任何情况下,grep都是这项工作的错误工具,因为您只想与输入文件中的特定字段进行匹配,因此应该使用awk代替:

In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead:

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
data
>EA9 text_again
data_here

以上假设您的数据"行不能以>开头.如果他们能告诉我们如何识别数据线和ID线.

The above assumes your "data" lines cannot start with >. If they can then tell us how to identify data lines vs id lines.

请注意,无论您在id行之间有多少data行,即使有0或100:

Note that the above will work no matter how many data lines you have between id lines, even if there's 0 or 100:

$ cat file.data
>EA4 text
>E40 blah
more_data
>EA9 text_again
data 1
data 2
data 3

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
>EA9 text_again
data 1
data 2
data 3

此外,您无需将输出通过管道传输到grep -v:

Also, you don't need to pipe the output to grep -v:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

只需在一个脚本中完成所有操作即可

just do it all in the one script:

awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data

这篇关于为什么使用-w(--word-regexp)标志,grep这么慢并且占用大量内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆