默认的字段分隔符awk的 [英] default field separator for awk

查看:145
本文介绍了默认的字段分隔符awk的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对不起,这个愚蠢的问题,搜查,但没有信心是正确的答案被发现,所以默认的分隔符是只为AWK空间?

Sorry for this stupid question, searched but not confident is the right answer is found, so the default separator is only space for awk?

推荐答案

下面是一个务实总结适用于所有主要的awk的实施


  • GNU 的awk的( GAWK ) - 默认 AWK 中的部分的Linux发行版

  • Mawk 的( mawk ) - 中的某些默认 AWK 的Linux发行版(如Ubuntu的)

  • BSD 的awk的 - 又名BWK awk的 - 默认 AWK 上类似BSD平台,包括OSX

  • GNU Awk (gawk) - the default awk in some Linux distros
  • Mawk (mawk) - the default awk in some Linux distros (e.g., Ubuntu)
  • BSD Awk - a.k.a. BWK Awk - the default awk on BSD-like platforms, including OSX

在Linux上, awk的-W版本会告诉你哪个执行默认 AWK 是< BR>
BSD awk中的只有的理解的awk --version (其中GNU awk中了解到的除了 AWK -W版本

On Linux, awk -W version will tell you which implementation the default awk is.
BSD Awk only understands awk --version (which GNU Awk understands in addition to awk -W version).

所有的这些实现遵循 POSIX标准最新版本相对于的字段的分隔符 [1] (但不是的记录的分隔符)。

Recent versions of all these implementations follow the POSIX standard with respect to field separators[1] (but not record separators).

名词解释:


  • RS 输入 - 记录的分离,它描述输入是如何分成的记录

  • RS is the input-record separator, which describes how the input is broken into records:


  • POSIX规定的默认值换行后,也被称为 \\ n 下方;也就是说,输入默认情况下,分成的

  • AWK 的命令行, RS 可以被指定为 -v RS = LT;九月方式&gt;

  • POSIX限制 RS 文字,单字符的价值,但GNU awk和Mawk支持的多字符可能的扩展的正前pressions值的(BSD awk并的的支持,)。

  • The POSIX-mandated default value is a newline, also referred to as \n below; that is, input is broken into lines by default.
  • On awk's command line, RS can be specified as -v RS=<sep>.
  • POSIX restricts RS to a literal, single-character value, but GNU Awk and Mawk support multi-character values that may be extended regular expressions (BSD Awk does not support that).

FS 输入 - 的字段的分离,它描述如何每个记录的分成的字段;它可能是一个的扩展的正前pression

FS is the input-field separator, which describes how each record is split into fields; it may be an extended regular expression.


  • AWK 的命令行, FS 可以被指定为 -F&LT ;九月&GT; (或 -v FS =&LT;&九月GT; )。

  • 授权的默认值是的正式一的空格的( 0x20的),但空间不是的的字面的跨preTED为(只)分离器,但特殊意义;请看下文。

  • On awk's command line, FS can be specified as -F <sep> (or -v FS=<sep>).
  • The POSIX-mandated default value is formally a space (0x20), but that space is not literally interpreted as the (only) separator, but has special meaning; see below.

默认


  • 任何运行 空间 和/或标签和/或换行的被视为字段分隔符

  • 开头和结尾的运行忽略

  • any run of spaces and/or tabs and/or newlines is treated as a field separator
  • with leading and trailing runs ignored.

的POSIX规范。 使用抽象&LT;坯件GT; 的空间和标签的,这为的所有的语言环境是真实的,但是的可能的包含的其他的具体语言环境中的人物 - 我不知道存在任何这样的语言环境。

The POSIX spec. uses the abstraction <blank> for spaces and tabs, which is true for all locales, but could comprise additional characters in specific locales - I don't know if any such locales exist.

注意与默认的输入记录分隔符 RS ), \\ n 新行的一般的不进入图片作为字段分隔,因为没有记录的本身的包含 \\ n 在这种情况下。

Note that with the default input-record separator (RS), \n, newlines typically do not enter the picture as field separators, because no record itself contains \n in that case.

换行符字段分隔符的的发挥作用,但是:

Newlines as field separators do come into play, however:


  • 在当前的 RS 设置为导致记录的值自己的含 \\ n 实例时(如 RS 设置为空字符串的;见下文)。

  • 一般的,当拆分()功能是用来分割字符串为数组元素没有明确的场分离参数

    • 虽然的输入记录的将不包含 \\ n 实例情况下,默认 RS 生效,当不上的多行字符串明确现场分离器的参数,从不同的源调用拆分()功能(例如,通过 -v 选项或伪名传递一个变量)的总是的治疗 \\ n 作为一个字段分隔符。

    • When RS is set to a value that results in records themselves containing \n instances (such as when RS is set to the empty string; see below).
    • Generally, when the split() function is used to split a string into array elements without an explicit-field separator argument.
      • Even though the input records won't contain \n instances in case the default RS is in effect, the split() function when invoked without an explicit field-separator argument on a multi-line string from a different source (e.g., a variable passed via the -v option or as a pseudo-filename) always treats \n as a field separator.

      重要非默认的因素


      • 分配的的字符串 RS 具有特殊的意义:它读取输入的段落模式,这意味着输入是通过的非空行的是领先的运行分成记录和空行尾运行忽略

      • Assigning the empty string to RS has special meaning: it reads the input in paragraph mode, meaning that the input is broken into records by runs of non-empty lines, with leading and trailing runs of empty lines ignored.

      分配什么的其他的比的文字的空间 FS 间$ p $ FS 的的变化从根本上

      When you assign anything other than a literal space to FS, the interpretation of FS changes fundamentally:


      • A 的字符或指定的字符每个字符的设置公认的个别的作为域分隔符 - 不是的运行它的,与默认。

        • 例如,设置 FS [] - 即使它的有效的金额为单个空格 - 导致每一个的个人的每个记录空间实例被视为一个字段分隔符。

        • 要认识到的运行,正则表达式量词(复制符号) + 必须使用;例如, [\\ t] + 将承认的运行选项卡作为单个分离

        • A single character or each character from a specified character set is recognized individually as a field separator - not runs of it, as with the default.
          • For instance, setting FS to [ ] - even though it effectively amounts to a single space - causes every individual space instance in each record to be treated as a field separator.
          • To recognize runs, the regex quantifier (duplication symbol) + must be used; e.g., [\t]+ would recognize runs of tabs as a single separator.

          [1]不幸的是,GNU awk中达到至少版本4.1.3符合一个的过时的相对于字段分隔POSIX标准,当您使用该选项符合PO​​SIX标准, -P - POSIX ):在效果选项, RS 设置为非空的值,换行符( \\ n 实例)不能识别为字段分隔符。 GNU的awk的手册阐明了过时的行为(但忽略不提,当 RS 设置为的字符串,它不适用)。 POSIX标准在2008年(见注释)更改为的考虑的换行的字段分隔符时, FS 有它的默认值 - 作为GNU awk中一直做的没有 -P 。( - POSIX

          下面是验证上述行为2命令:结果
          *使用 -P 生效和 RS 设置为空字符串 \\ n 还是的视为字段分隔符:结果
          GAWK -P -F'-v RS ='''{printf的&LT;%S&GT中,&lt;%S&GT; \\ n,$ 1,$ 2}'&LT;&LT;&LT; $'一\\ NB结果
          *使用 -P生效非空 RS \\ n 不被视为一个字段分隔符 - 这是过时的行为:结果
          GAWK -P -F'-v RS ='|' '{printf的&LT;%S&GT中,&lt;%S&GT; \\ n,$ 1,$ 2}'&LT;&LT;&LT; $'一\\ NB结果
          一个修复程序来了的,根据GNU awk的维护者;期待它在版本的 4.2 的(没有时间框架给出)。结果
          (帽子到@JohnKugelman和@EdMorton他们的帮助的提示。)

          [1] Unfortunately, GNU Awk up to at least version 4.1.3 complies with an obsolete POSIX standard with respect to field separators when you use the option to enforce POSIX compliance, -P (--posix): with that option in effect and RS set to a non-empty value, newlines (\n instances) are NOT recognized as field separators. The GNU Awk manual spells out the obsolete behavior (but neglects to mention that it doesn't apply when RS is set to the empty string). The POSIX standard changed in 2008 (see comments) to also consider newlines field separators when FS has its default value - as GNU Awk has always done without -P (--posix).
          Here are 2 commands that verify the behavior described above:
          * With -P in effect and RS set to the empty string, \n is still treated as a field separator:
          gawk -P -F' ' -v RS='' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
          * With -P in effect and a non-empty RS, \n is NOT treated as a field separator - this is the obsolete behavior:
          gawk -P -F' ' -v RS='|' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
          A fix is coming, according to the GNU Awk maintainers; expect it in version 4.2 (no time frame given).
          (Tip of the hat to @JohnKugelman and @EdMorton for their help.)

          这篇关于默认的字段分隔符awk的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆