检查文件中是否存在所有多个字符串或正则表达式 [英] Check if all of multiple strings or regexes exist in a file

查看:17
本文介绍了检查文件中是否存在所有多个字符串或正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查文本文件中是否存在所有我的字符串.它们可以存在于同一行或不同的行上.部分匹配应该没问题.像这样:

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

在上面的例子中,我们可以用正则表达式代替字符串.

In the above example, we could have regexes in place of strings.

例如,以下代码检查文件中是否存在任何我的字符串:

For example, the following code checks if any of my strings exists in the file:

if grep -EFq "string1|string2|string3" file; then
  # there is at least one match
fi

如何检查所有是否存在?由于我们只对所有匹配的存在感兴趣,我们应该在所有字符串都匹配后停止读取文件.

How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

是否可以在不必多次调用 grep 的情况下执行此操作(当输入文件很大或我们有大量要匹配的字符串时,它不会扩展)或使用工具像 awkpython?

Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

另外,是否有可以轻松扩展为正则表达式的字符串的解决方案?

Also, is there a solution for strings that can easily be extended for regexes?

推荐答案

awk 是发明 grep、shell 等的人发明的工具,用于执行此类一般文本操作工作,因此不确定您为什么要这样做尽量避免.

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

如果您需要的是简洁,这里是 GNU awk one-liner,可以满足您的要求:

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

这里有很多其他信息和选项:

And here's a bunch of other information and options:

假设你真的在寻找字符串,它会是:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

一旦所有字符串都匹配,上面将停止读取文件.

the above will stop reading the file as soon as all strings have matched.

如果您正在寻找正则表达式而不是字符串,那么使用 GNU awk 进行多字符 RS 并在 END 部分保留 $0,您可以这样做:

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

实际上,即使是字符串,您也可以这样做:

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

上述 2 个 GNU awk 解决方案的主要问题是,就像@anubhava 的 GNU grep -P 解决方案一样,整个文件必须一次读入内存,而使用上面的第一个 awk 脚本,它将在任何 UNIX 机器上的任何 shell 中的任何 awk,并且一次只存储一行输入.

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

我看到你在你的问题下添加了一条评论,说你可以有几千个模式".假设您的意思是字符串",那么您可以从文件中读取它们,而不是将它们作为参数传递给脚本,例如使用用于多字符 RS 的 GNU awk 和每行一个搜索字符串的文件:

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

对于正则表达式来说就是:

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

如果您没有 GNU awk 并且您的输入文件不包含 NUL 字符,那么您可以使用 RS='' 而不是 RS 获得与上述相同的效果='^$' 或通过在读取变量时一次添加一行,然后在 END 部分处理该变量.

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

如果您的 file_to_be_searched 太大而无法放入内存,那么对于字符串来说就是这样:

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

和正则表达式的等价物:

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

这篇关于检查文件中是否存在所有多个字符串或正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆