了解awk分隔符-在基于正则表达式的字段分隔符中转义 [英] Understanding awk delimiter - escaping in a regex-based field separator
问题描述
我有以下shell命令:
I have the following shell command:
awk -F'\[|\]' '{print $2}'
该命令在做什么?用作分隔符[sometext]
?
What is this command doing? Split into fields using as delimiter [sometext]
?
例如:
$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line
编者注:默认情况下,仅在Ubuntu上使用的 Mawk 会产生上面的输出.
Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.
推荐答案
表观的意图是将 literal [
和]
视为字段-分隔符,即按[
和/或]
的每次出现将每个输入记录划分为多个字段,这些字段与示例行一起产生this
作为字段1($1
) ,line
作为字段2($2
),并且 passed to awk
作为最后一个字段($3
).
The apparent intent is to treat literal [
and ]
as field-separator characters, i.e., to split each input record into fields by each occurrence of [
and/or ]
, which, with the sample line, yields this
as field 1 ($1
), line
as field 2 ($2
), and passed to awk
as the last field ($3
).
这是由使用 alternation (|
)的 regex (正则表达式)实现的,该表达式的任意一侧都定义了一个字段分隔符(定界符):正则表达式中的\[
和\]
代表 literal [
和]
,因为默认情况下,[
和]
如此-称为元字符(具有特殊句法含义的字符).
请注意,awk
总是将FS
变量(-F
选项)的值解释为 regex .
This is achieved by a regex (regular expression) that uses alternation (|
), either side of which defines a field separator (delimiter): \[
and \]
in a regex are needed to represent literal [
and ]
, because, by default, [
and ]
are so-called metacharacters (characters with special syntactical meaning).
Note that awk
always interprets the value of the FS
variable (-F
option) as a regex.
但是,正确的格式是'\\[|\\]'
:
However, the correct form is '\\[|\\]'
:
$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line
也就是说,使用字符集([...]
)而不是替代字符(|
)的更为简洁的版本是:
That said, a more concise version that uses a character set ([...]
) rather than alternation (|
) is:
$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line
注意将]
放在[...]
内的[
之前的位置要小心,以使这项工作有效,以及包围 [...]
现在有特殊含义:将它们围起来一组 个字符,其中任何一个都匹配.
Note the careful placement of ]
before [
inside the enclosing [...]
to make this work, and how the enclosing [...]
now have special meaning: they enclose a set of characters, any of which matches.
关于为什么在'\\[|\\]'
中需要 2 \
实例:
单独用作正则表达式 ,\[|\]
会起作用:
Taken as a regex in isolation, \[|\]
would work:
-
\[
匹配文字[
-
\]
匹配文字]
-
|
是一个与另一个匹配的替代项.
\[
matches literal[
\]
matches literal]
|
is an alternation that matches one or the other.
但是, Awk的 string 处理优先:
However, Awk's string processing comes first:
-
应该应该,因为在字符串中进行了
\
处理,因此在解释之前将\[|\]
减小为[|]
作为 regex .
It should, due to
\
handling in a string, reduce\[|\]
to[|]
before interpretation as a regex.
- 不幸的是,例如 Mawk (例如Ubuntu上的默认Awk)采用了<在这种特定情况下,他们会进行练习. [1]
- Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
[|]
(解释为正则表达式)将仅匹配单字面量 |
[|]
, interpreted as a regex, would then only match a single, literal |
因此,一种健壮且可移植的方法是,当您要传递单 \
作为其一部分时,请在字符串文字中使用\\
regex .
Thus, the robust and portable way is to use \\
in a string literal when you mean to pass a single \
as part of a regex.
此引文摘自GNU Awk手册的相关部分总结得很好:
This quote from the relevant section of the GNU Awk manual sums it up well:
要在字符串内的正则表达式中添加反斜杠,必须键入两个反斜杠.
To get a backslash into a regular expression inside a string, you have to type two backslashes.
[1] 实施差异:
不幸的是,在字符串文字内的正则表达式元字符之前只有一个\
的情况下,至少有1个主要的Awk实现诉诸于 guesswork .
Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \
before a regex metacharacter inside a string literal.
BSD/macOS Awk和GNU Awk的行为可预测,并且当发现单个\
前缀的正则表达式元字符时,GNU Awk也会发出有用的警告:
BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \
-prefixed regex metacharacter is found:
# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c
# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c
# Mawk: *Guesses* that a *regex* was intended.
# The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b
可选阅读: regex 常量 inside Awk脚本
Awk支持包含在/.../
中的 regex 文字,使用它们可以避免双重转义问题.
Optional reading: regex literals inside Awk scripts
Awk supports regex literals enclosed in /.../
, the use of which bypasses the double-escaping problem.
但是:
- 这些文字(始终不变)只能在Awk脚本内 中使用,
- 和,看来,您只能将它们用作模式或函数参数-无法将它们存储在变量中.
- These literals (which are invariably constant) are only available inside an Awk script,
- and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.
因此,即使/\[|\]/
原则上 等同于"\\[|\\]"
,您也可以不使用以下内容,因为无法将正则表达式文字分配给( )变量FS
:
Therefore, even though /\[|\]/
is in principle equivalent to "\\[|\\]"
, you can not use the following, because the regex literal cannot be assigned to (special) variable FS
:
# !! DOES NOT WORK in any of the 3 major Awk implementations.
# Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'
# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b
这篇关于了解awk分隔符-在基于正则表达式的字段分隔符中转义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!