默认的字段分隔符awk的 [英] default field separator for awk
问题描述
对不起,这个愚蠢的问题,搜查,但没有信心是正确的答案被发现,所以默认的分隔符是只为AWK空间?
Sorry for this stupid question, searched but not confident is the right answer is found, so the default separator is only space for awk?
推荐答案
下面是一个务实总结适用于所有主要的awk的实施
- 的 GNU 的awk的(
GAWK
) - 默认AWK
中的部分的Linux发行版 - 的 Mawk 的(
mawk
) - 中的某些默认AWK
的Linux发行版(如Ubuntu的) - 的 BSD 的awk的 - 又名BWK awk的 - 默认
AWK
上类似BSD平台,包括OSX
- GNU Awk (
gawk
) - the defaultawk
in some Linux distros - Mawk (
mawk
) - the defaultawk
in some Linux distros (e.g., Ubuntu) - BSD Awk - a.k.a. BWK Awk - the default
awk
on BSD-like platforms, including OSX
在Linux上, awk的-W版本
会告诉你哪个执行默认 AWK
是< BR>
BSD awk中的只有的理解的awk --version
(其中GNU awk中了解到的除了到 AWK -W版本
)
On Linux, awk -W version
will tell you which implementation the default awk
is.
BSD Awk only understands awk --version
(which GNU Awk understands in addition to awk -W version
).
的所有的这些实现遵循 POSIX标准最新版本相对于的字段的分隔符 [1] (但不是的记录的分隔符)。
Recent versions of all these implementations follow the POSIX standard with respect to field separators[1] (but not record separators).
名词解释:
-
RS
在输入 - 记录的分离,它描述输入是如何分成的记录的
RS
is the input-record separator, which describes how the input is broken into records:
- 在 POSIX规定的默认值为换行后,也被称为
\\ n
下方;也就是说,输入默认情况下,分成的行的 - 在
AWK
的命令行,RS
可以被指定为-v RS = LT;九月方式&gt;
- POSIX限制
RS
到文字,单字符的价值,但GNU awk和Mawk支持的多字符可能的扩展的正前pressions值的(BSD awk并的不的支持,)。
- The POSIX-mandated default value is a newline, also referred to as
\n
below; that is, input is broken into lines by default. - On
awk
's command line,RS
can be specified as-v RS=<sep>
. - POSIX restricts
RS
to a literal, single-character value, but GNU Awk and Mawk support multi-character values that may be extended regular expressions (BSD Awk does not support that).
FS
与输入 - 的字段的分离,它描述如何的每个记录的分成的字段的;它可能是一个的扩展的正前pression 的
FS
is the input-field separator, which describes how each record is split into fields; it may be an extended regular expression.
- 在
AWK
的命令行,FS
可以被指定为-F&LT ;九月&GT;
(或-v FS =&LT;&九月GT;
)。 - 授权的默认值是的正式一的空格的(
0x20的
),但空间不是的的字面的跨preTED为(只)分离器,但特殊意义的;请看下文。
- On
awk
's command line,FS
can be specified as-F <sep>
(or-v FS=<sep>
). - The POSIX-mandated default value is formally a space (
0x20
), but that space is not literally interpreted as the (only) separator, but has special meaning; see below.
默认
- 任何运行 空间 和/或的标签和/或换行的被视为字段分隔符
- 是开头和结尾的运行忽略
- any run of spaces and/or tabs and/or newlines is treated as a field separator
- with leading and trailing runs ignored.
的POSIX规范。 使用抽象&LT;坯件GT;
的空间和标签的,这为的所有的语言环境是真实的,但是的可能的包含的其他的具体语言环境中的人物 - 我不知道存在任何这样的语言环境。
The POSIX spec. uses the abstraction <blank>
for spaces and tabs, which is true for all locales, but could comprise additional characters in specific locales - I don't know if any such locales exist.
注意与默认的输入记录分隔符( RS
), \\ n
,新行的一般的不进入图片作为字段分隔,因为没有记录的本身的包含 \\ n
在这种情况下。
Note that with the default input-record separator (RS
), \n
, newlines typically do not enter the picture as field separators, because no record itself contains \n
in that case.
换行符字段分隔符的不的发挥作用,但是:
Newlines as field separators do come into play, however:
- 在当前的
RS
设置为导致记录的值自己的含\\ n
实例时(如RS
设置为空字符串的;见下文)。 - 一般的,当
拆分()
功能是用来分割字符串为数组元素没有明确的场分离参数- 虽然的输入记录的将不包含
\\ n
实例情况下,默认RS
生效,当不上的多行字符串明确现场分离器的参数,从不同的源调用拆分()
功能(例如,通过-v
选项或伪名传递一个变量)的总是的治疗\\ n
作为一个字段分隔符。
- When
RS
is set to a value that results in records themselves containing\n
instances (such as whenRS
is set to the empty string; see below). - Generally, when the
split()
function is used to split a string into array elements without an explicit-field separator argument.- Even though the input records won't contain
\n
instances in case the defaultRS
is in effect, thesplit()
function when invoked without an explicit field-separator argument on a multi-line string from a different source (e.g., a variable passed via the-v
option or as a pseudo-filename) always treats\n
as a field separator.
重要非默认的因素
-
分配的空的字符串
RS
具有特殊的意义:它读取输入的的段落模式的,这意味着输入是通过的非空行的是领先的运行分成记录和空行尾运行忽略
Assigning the empty string to
RS
has special meaning: it reads the input in paragraph mode, meaning that the input is broken into records by runs of non-empty lines, with leading and trailing runs of empty lines ignored.
在分配什么的其他的比的文字的空间
FS
在间$ p $FS
的的变化从根本上的When you assign anything other than a literal space to
FS
, the interpretation ofFS
changes fundamentally:- A 单的字符或指定的字符每个字符的设置的为公认的个别的作为域分隔符 - 不是的运行它的,与默认。
- 例如,设置
FS
到[]
- 即使它的有效的金额为单个空格 - 导致每一个的个人的每个记录空间实例被视为一个字段分隔符。 - 要认识到的运行,正则表达式量词(复制符号)
+
必须使用;例如,[\\ t] +
将承认的运行选项卡作为单个分离的
- A single character or each character from a specified character set is recognized individually as a field separator - not runs of it, as with the default.
- For instance, setting
FS
to[ ]
- even though it effectively amounts to a single space - causes every individual space instance in each record to be treated as a field separator. - To recognize runs, the regex quantifier (duplication symbol)
+
must be used; e.g.,[\t]+
would recognize runs of tabs as a single separator.
[1]不幸的是,GNU awk中达到至少版本4.1.3符合一个的过时的相对于字段分隔POSIX标准,当您使用该选项符合POSIX标准,
-P
(- POSIX
):在效果选项,RS
设置为非空的值,换行符(\\ n
实例)不能识别为字段分隔符。 GNU的awk的手册阐明了过时的行为(但忽略不提,当RS
设置为空的字符串,它不适用)。 POSIX标准在2008年(见注释)更改为也的考虑的换行的字段分隔符时,FS
有它的默认值 - 作为GNU awk中一直做的没有的-P
。(- POSIX
)
下面是验证上述行为2命令:结果
*使用-P
生效和RS
设置为空字符串的\\ n
是还是的视为字段分隔符:结果GAWK -P -F'-v RS ='''{printf的&LT;%S&GT中,&lt;%S&GT; \\ n,$ 1,$ 2}'&LT;&LT;&LT; $'一\\ NB
结果
*使用-P生效
和非空的RS
,\\ n
不被视为一个字段分隔符 - 这是过时的行为:结果GAWK -P -F'-v RS ='|' '{printf的&LT;%S&GT中,&lt;%S&GT; \\ n,$ 1,$ 2}'&LT;&LT;&LT; $'一\\ NB
结果
的一个修复程序来了的,根据GNU awk的维护者;期待它在版本的 4.2 的(没有时间框架给出)。结果
(帽子到@JohnKugelman和@EdMorton他们的帮助的提示。)
[1] Unfortunately, GNU Awk up to at least version 4.1.3 complies with an obsolete POSIX standard with respect to field separators when you use the option to enforce POSIX compliance,
-P
(--posix
): with that option in effect andRS
set to a non-empty value, newlines (\n
instances) are NOT recognized as field separators. The GNU Awk manual spells out the obsolete behavior (but neglects to mention that it doesn't apply whenRS
is set to the empty string). The POSIX standard changed in 2008 (see comments) to also consider newlines field separators whenFS
has its default value - as GNU Awk has always done without-P
(--posix
).
Here are 2 commands that verify the behavior described above:
* With-P
in effect andRS
set to the empty string,\n
is still treated as a field separator:
gawk -P -F' ' -v RS='' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
* With-P
in effect and a non-emptyRS
,\n
is NOT treated as a field separator - this is the obsolete behavior:
gawk -P -F' ' -v RS='|' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
A fix is coming, according to the GNU Awk maintainers; expect it in version 4.2 (no time frame given).
(Tip of the hat to @JohnKugelman and @EdMorton for their help.)这篇关于默认的字段分隔符awk的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- For instance, setting
- 例如,设置
- Even though the input records won't contain
- 虽然的输入记录的将不包含