gawk FS将记录拆分为单个字符 [英] gawk FS to split record into individual characters

查看:81
本文介绍了gawk FS将记录拆分为单个字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果字段分隔符为空字符串,则每个字符将成为一个单独的字段

If the field separator is the empty string, each character becomes a separate field

$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o

但是,如果FS是可能匹配零次的正则表达式,则不会发生相同的行为:

不是:

However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:

$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

有人知道为什么吗?我在 gawk手册中找不到任何内容. FS=""只是特例吗?

Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?

我最想了解为什么第二种情况不会将记录分成更多字段.好像awk像FS=" +"

I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"

推荐答案

有趣的问题!

我刚刚提取了gnu-awk 4.1.0的代码,我认为我们可以在文件field.c中找到答案.

I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.

line 371:
 * re_parse_field --- parse fields using a regexp.
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a regular
 * expression -- either user-defined or because RS=="" and FS==" "
 */
static long
re_parse_field(lo...

也是这一行:(line 425):

if (REEND(rp, scan) == RESTART(rp, scan)) {   /* null match */

这是您的问题中<space>*匹配的情况.该实现没有增加nf,也就是说,它认为整行是一个单独的字段.请注意,此功能也在do_split()函数中使用.

here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.

首先,如果FS为空字符串,则gawk将每个字符分隔到其自己的字段中. gawk的文档清楚地用代码编写了此代码,我们可以看到:

First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:

line 613:
 * null_parse_field --- each character is a separate field
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is the null string.
 */
static long
null_parse_field(long up_to,

如果FS具有单个字符,则awk不会将其视为正则表达式.在文档中也提到了这一点.同样在代码中:

If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:

#line 667
 * sc_parse_field --- single character field separator
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a single character
 * other than space.
 */
static long
sc_parse_field(l

如果我们阅读该功能,则不会在其中进行任何正则表达式匹配处理.

if we read the function, no regex match handling was done there.

在函数re_parse_field()sc_parse_field()的注释中,我们看到do_split也会调用它们.它解释了为什么在以下命令中使用1而不是3

In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:

kent$  echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1

注意,为避免过长,我没有在此处粘贴完整的代码,我们可以在此处找到代码:

Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:

http://git.savannah.gnu.org/cgit/gawk.git/

这篇关于gawk FS将记录拆分为单个字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆