如何使用awk使用非贪婪的正则表达式提取嵌套定界符内的数据 [英] How do I use awk to extract data within nested delimiters using non-greedy regexps

查看:490
本文介绍了如何使用awk使用非贪婪的正则表达式提取嵌套定界符内的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题以许多形式重复出现,具有许多不同的多字符定界符,因此恕我直言值得一个规范的答案.

This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.

给出一个输入文件,例如:

Given an input file like:

<foo> .. 1 <foo> .. a<2 .. </foo> .. </foo> <foo> .. @{<>}@ <foo> .. 4 .. </foo> .. </foo> <foo> .. 5 .. </foo>

如何使用与awk的非贪婪匹配来提取嵌套的开始(<foo>)和结束(</foo>)分隔符之间的文本?

how do you extract the text between the nested start (<foo>) and end (</foo>) delimiters using non-greedy matching with awk?

所需的输出(以任何顺序)是:

Desired output (in any order) is:

<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

请注意,开始或结束可以是任何多字符字符串,并且它们之间的文本可以是除那些字符串之外的任何东西,包括那些字符串中的字符,例如本例中的<>字符.

Note that start or end could be any multi-character string and the text between them could be anything except those strings, including characters which are part of those strings such as the < or > characters in this example.

推荐答案

主要挑战在于,由于awk仅支持贪婪匹配,因此您无法编写任何会在第一个</foo>处停止的<foo>.*</foo>变体.行而不是最后一个</foo>.解决方案是将每个开始和结束字符串转换为不能出现在输入中的单个字符,因此您可以编写x[^xy]*y,其中x和y是那些开始/结束字符,但是如何选择一个不会出现在输入中的字符输入?你不-你做一个:

The main challenge is that since awk only supports greedy matching you can't write any variation of <foo>.*</foo> that will stop at the first </foo> on the line instead of the last </foo>. The solution is to convert each start and end string into a single character that cannot appear in the input so you can write x[^xy]*y where x and y are those start/end characters but how do you choose a character that can't appear in the input? You don't - you make one:

$ cat nonGreedy.awk
{
    $0 = encode($0)
    while ( match($0,/({[^{}]*})/) ) {
        print decode(substr($0,RSTART,RLENGTH))
        $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
    }
}
function encode(str) {
    gsub(/@/,"@A",str)
    gsub(/{/,"@B",str); gsub(/}/,"@C",str)
    gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
    return str
}
function decode(str) {
    gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
    gsub(/@C/,"}",str); gsub(/@B/,"{",str)
    gsub(/@A/,"@",str)
    return str
}

$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

上面的方法是通过选择不能仅在开始/结束字符串中出现的任何字符来工作的(请注意,它不必是根本不能出现在输入中的字符,只是不在这些字符串中出现) ),在这种情况下,我选择@,并在输入中每次出现A后附加一个A.此时,每次出现的@A都代表一个@字符,并且保证在输入中的任何其他位置都不会出现@B@.

The above works by you picking any character that can't appear JUST IN THE START/END STRINGS (note it doesn't have to be a character that can't appear in the input at all, just not in those strings), in this case I'm choosing @, and appending an A after each occurrence of it in the input. At this point every occurrence of @A represents an @ character and there are guaranteed to be no occurrences of @B or @ followed by anything else anywhere in the input.

现在,我们可以选择两个其他字符来表示开始/结束字符串,在这种情况下,我选择{},并将它们转换为一些@前缀的字符串,例如@B@C,此时每次出现的@B代表一个{字符,而@C代表一个}字符,并且输入中的任何地方都没有{}

Now we can pick 2 other characters that we want to use to represent the start/end strings, in this case I'm choosing { and }, and convert them to some @-prefixed strings like @B and @C and at this point every occurrence of @B represents a { character and @C represents a } character and there are no {s or }s anywhere in the input.

现在要查找要提取的字符串,剩下要做的就是将每个起始字符串<foo>转换为我们选择的起始字符{,并将每个结束字符串</foo>转换为结束字符},然后我们可以使用{[^{}]*}的简单正则表达式来表示<foo>.*</foo>的非贪婪版本.

Now all that's left to do to find the strings we want to extract is convert every start string <foo> to the start character we've chosen, {, and every end string </foo> to the end character } and then we can use a simple regexp of {[^{}]*} to represent a non-greedy version of <foo>.*</foo>.

当我们找到每个字符串时,我们只是按照相反的顺序展开了上面所做的转换(请注意,您必须以与将它们应用于整个记录的完全相反的顺序展开对每个匹配字符串的替换),因此{返回到<foo>@B返回至{,并且@A返回至@,依此类推,我们拥有该字符串的原始文本.

As we find each string we just unwind the conversions we did above in reverse order (note you must unwind the substitutions to each matched string in exactly the reverse order you applied them to the whole record) so { goes back to <foo> and @B goes back to {, and @A goes back to @, etc. and we have the original text for that string.

以上内容可在任何awk中使用.如果您的开始/结束字符串包含RE元字符,则您必须转义这些字符或使用while(index(substr()))循环而不是gsub()来替换它们.

The above will work in any awk. If your start/end strings contain RE metacharacters then you'd have to escape those or use a while(index(substr())) loop instead of gsub() to replace them.

请注意,如果您确实使用了gawk并且标签没有嵌套,则可以完全像上面那样保留2个函数,并将脚本的其余部分更改为:

Note that if you do use gawk and the labels aren't nested then you can keep the 2 functions exactly as above and change the rest of the script to just:

BEGIN { FPAT="{[^{}]*}" }
{
    $0 = encode($0)
    for (i=1; i<=NF; i++) {
        print decode($i)
    }
}

显然,您实际上并不需要将编码/解码功能放在单独的函数中,我只是在此处将其分离出来以使该功能明确,并与使用它的循环分开以保持清晰.

Obviously you don't really need to put the encode/decode functionality in separate functions, I just separated that out here to make that functionality explicit and separate from the loop that uses it for clarity.

有关何时/如何应用上述方法的另一个示例,请参见 https://stackoverflow.com/a/40540160/1745001 .

For another example of when/how to apply the above approach, see https://stackoverflow.com/a/40540160/1745001.

这篇关于如何使用awk使用非贪婪的正则表达式提取嵌套定界符内的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆