了解awk分隔符-在基于正则表达式的字段分隔符中转义 [英] Understanding awk delimiter - escaping in a regex-based field separator

查看:247
本文介绍了了解awk分隔符-在基于正则表达式的字段分隔符中转义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下shell命令:

I have the following shell command:

awk -F'\[|\]' '{print $2}'

该命令在做什么?用作分隔符[sometext]?

What is this command doing? Split into fields using as delimiter [sometext]?

例如:

$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line

编者注:默认情况下,仅在Ubuntu上使用的 Mawk 会产生上面的输出.

Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.

推荐答案

表观的意图是将 literal []视为字段-分隔符,即按[和/或]的每次出现将每个输入记录划分为多个字段,这些字段与示例行一起产生this 作为字段1($1) ,line作为字段2($2),并且 passed to awk作为最后一个字段($3).

The apparent intent is to treat literal [ and ] as field-separator characters, i.e., to split each input record into fields by each occurrence of [ and/or ], which, with the sample line, yields this  as field 1 ($1), line as field 2 ($2), and  passed to awk as the last field ($3).

这是由使用 alternation (|)的 regex (正则表达式)实现的,该表达式的任意一侧都定义了一个字段分隔符(定界符):正则表达式中的\[\]代表 literal [],因为默认情况下,[]如此-称为元字符(具有特殊句法含义的字符).
请注意,awk 总是FS变量(-F选项)的值解释为 regex .

This is achieved by a regex (regular expression) that uses alternation (|), either side of which defines a field separator (delimiter): \[ and \] in a regex are needed to represent literal [ and ], because, by default, [ and ] are so-called metacharacters (characters with special syntactical meaning).
Note that awk always interprets the value of the FS variable (-F option) as a regex.

但是,正确的格式是'\\[|\\]' :

However, the correct form is '\\[|\\]':

$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line

也就是说,使用字符集([...])而不是替代字符(|)的更为简洁的版本是:

That said, a more concise version that uses a character set ([...]) rather than alternation (|) is:

$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line

注意将]放在[...]内的[之前的位置要小心,以使这项工作有效,以及包围 [...]现在有特殊含义:将它们围起来一组 个字符,其中任何一个都匹配.

Note the careful placement of ] before [ inside the enclosing [...] to make this work, and how the enclosing [...] now have special meaning: they enclose a set of characters, any of which matches.

关于为什么在'\\[|\\]' 中需要 2 \实例:

单独用作正则表达式 \[|\]会起作用:

Taken as a regex in isolation, \[|\] would work:

  • \[匹配文字[
  • \]匹配文字]
  • |是一个与另一个匹配的替代项.
  • \[ matches literal [
  • \] matches literal ]
  • | is an alternation that matches one or the other.

但是, Awk的 string 处理优先:

However, Awk's string processing comes first:

  • 应该应该,因为在字符串中进行了\处理,因此在解释之前将\[|\]减小为[|] 作为 regex .

  • It should, due to \ handling in a string, reduce \[|\] to [|] before interpretation as a regex.

  • 不幸的是,例如 Mawk (例如Ubuntu上的默认Awk)采用了<在这种特定情况下,他们会进行练习. [1]
  • Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]

[|](解释为正则表达式)将仅匹配单字面量 |

[|], interpreted as a regex, would then only match a single, literal |

因此,一种健壮且可移植的方法是,当您要传递 \作为其一部分时,请在字符串文字中使用\\ regex .

Thus, the robust and portable way is to use \\ in a string literal when you mean to pass a single \ as part of a regex.

此引文摘自GNU Awk手册的相关部分总结得很好:

This quote from the relevant section of the GNU Awk manual sums it up well:

要在字符串内的正则表达式中添加反斜杠,必须键入两个反斜杠.

To get a backslash into a regular expression inside a string, you have to type two backslashes.


[1] 实施差异:

不幸的是,在字符串文字内的正则表达式元字符之前只有一个\的情况下,至少有1个主要的Awk实现诉诸于 guesswork .

Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \ before a regex metacharacter inside a string literal.

BSD/macOS Awk和GNU Awk的行为可预测,并且当发现单个\前缀的正则表达式元字符时,GNU Awk也会发出有用的警告:

BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \-prefixed regex metacharacter is found:

# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c

# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c

# Mawk: *Guesses* that a *regex* was intended.
#       The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b


可选阅读: regex 常量 inside Awk脚本

Awk支持包含在/.../中的 regex 文字,使用它们可以避免双重转义问题.


Optional reading: regex literals inside Awk scripts

Awk supports regex literals enclosed in /.../, the use of which bypasses the double-escaping problem.

但是:

  • 这些文字(始终不变)只能在Awk脚本内 中使用,
  • ,看来,您只能将它们用作模式函数参数-无法将它们存储在变量中.
  • These literals (which are invariably constant) are only available inside an Awk script,
  • and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.

因此,即使/\[|\]/原则上 等同于"\\[|\\]",您也可以使用以下内容,因为无法将正则表达式文字分配给( )变量FS:

Therefore, even though /\[|\]/ is in principle equivalent to "\\[|\\]", you can not use the following, because the regex literal cannot be assigned to (special) variable FS:

# !! DOES NOT WORK in any of the 3 major Awk implementations.
#    Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'

# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b

这篇关于了解awk分隔符-在基于正则表达式的字段分隔符中转义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆