检查文件中是否存在多个字符串或正则表达式 [英] Check if all of multiple strings or regexes exist in a file

查看:190
本文介绍了检查文件中是否存在多个字符串或正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查一下我的字符串是否存在于一个文本文件中。它们可以存在于同一条线上或不同的线上。部分匹配应该可以。像这样:

  ... 
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
...等等



<在上面的例子中,我们可以用正则表达式代替字符串。



例如,下面的代码检查文件中是否存在我的字符串的任何:

  if grep -EFqstring1 | string2 | string3file;那么
#至少有一个匹配
fi

如何检查if它们的所有存在吗?由于我们只对所有匹配的存在感兴趣,所以我们应该尽快停止阅读文件。



有可能不需要多次调用 grep (当输入文件很大或者我们有大量字符串匹配时不会缩放)或者使用工具例如 awk python



字符串解决方案,可以很容易地扩展为正则表达式?

解决方案

awk是发明grep,shell,等发明做这样的一般文本处理作业,所以不知道为什么你想试图避免它。



如果简短是你要找的东西,这里是GNU awk的单行版,可以完成你所要求的功能:

  awk'NR == FNR {a [$ 0 ]; next} {for(s in a)if(!index($ 0,s))exit 1}'strings RS ='^ $'file 

这里有一个其他一些信息和选项:



假设您真的在寻找字符串,它应该是:

 awk -v strings ='string1 string2 string3''
BEGIN {
numStrings = split(strings,tmp)
for(i in tmp)strs ($ str $ str $ {


$ str $ {


$ b $ ){
删除strs [str]
numStrings--
}
}
}
END {exit(numStrings? 1:0)}
'档案

上述档案会尽快停止读取档案所有字符串都匹配。如果您正在寻找正则表达式而不是字符串,那么对于多字符RS使用GNU awk,并在END部分保留$ 0,您可以这样做:

  awk -v RS ='^ $''END {exit!(/ regexp1 /&& / regexp2 /& amp ;& / regexp3 /)}'文件

实际上,即使它是字符串,

  awk -v RS ='^ $''END {exit!(index($ 0,string1)&& amp ; index($ 0,string2)&& index($ 0,string3))}'file 

上述两个GNU awk解决方案的主要问题是,像@ anubhava的GNU grep -P解决方案一样,整个文件必须一次读入内存,而使用上面的第一个awk脚本,它将工作在任何UNIX框的任何shell中的awk中,一次只能存储一行输入。



我看到你VE你的问题下添加评论说你可能有几千个模式。假设你的意思是字符串,那么不要将它们作为参数传递给脚本,你可以从文件中读取它们。 GNU awk用于多字符RS,并且每行有一个搜索字符串的文件:

$ $ p $ $ $ $ c $ awk
NR = = FNR {strings [$ 0];下一步}
{
for(string in strings)
if(!index($ 0,string))
exit 1
}
'file_of_strings RS = '^ $'file_to_be_searched

以及正则表达式:

  awk'
NR == FNR {regexps [$ 0];下一步}
{
for(regexp in regexps)
if($ 0!〜regexp)
exit 1
}
'file_of_regexps RS ='^ $ 'file_to_be_searched

如果您没有GNU awk并且您的输入文件不包含NUL字符,那么您可以通过使用 RS ='\ 0'而不是 RS ='^ $'来获得与上面相同的效果。或者在读取时一次追加一行变量,然后在END部分处理该变量。



如果您的file_to_be_searched太大而无法放入内存中, 'd对于字符串:

  awk'
NR == FNR {strings [$ 0]; numStrings = NR;下一步}
numStrings == 0 {exit}
{
for(字符串在字符串中){
if(index($ 0,string)){
delete strings [字符串]
numStrings--
}
}
}
END {退出(numStrings?1:0)}
'file_of_strings file_to_be_searched

以及正则表达式的等价物:

  awk'
NR == FNR {regexps [$ 0]; numRegexps = NR;下一步}
numRegexps == 0 {退出}
{
for(regexp in regexps){
if($ 0〜regexp){
delete regexps [regexp]
numRegexps--
}
}
}
END {exit(numRegexps?1:0)}
'file_of_regexps file_to_be_searched


I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

In the above example, we could have regexes in place of strings.

For example, the following code checks if any of my strings exists in the file:

if grep -EFq "string1|string2|string3" file; then
  # there is at least one match
fi

How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

Also, is there a solution for strings that can easily be extended for regexes?

解决方案

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

这篇关于检查文件中是否存在多个字符串或正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆