gawk FS将记录拆分为单个字符 [英] gawk FS to split record into individual characters
问题描述
如果字段分隔符为空字符串,则每个字符将成为一个单独的字段
If the field separator is the empty string, each character becomes a separate field
$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o
但是,如果FS是可能匹配零次的正则表达式,则不会发生相同的行为:
不是:
However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:
$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
有人知道为什么吗?我在 gawk手册中找不到任何内容. FS=""
只是特例吗?
Anyone know why that is? I could not find anything in the gawk manual. Is FS=""
just a special case?
我最想了解为什么第二种情况不会将记录分成更多字段.好像awk像FS=" +"
I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *"
like FS=" +"
推荐答案
有趣的问题!
我刚刚提取了gnu-awk 4.1.0的代码,我认为我们可以在文件field.c
中找到答案.
I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c
.
line 371:
* re_parse_field --- parse fields using a regexp.
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a regular
* expression -- either user-defined or because RS=="" and FS==" "
*/
static long
re_parse_field(lo...
也是这一行:(line 425
):
if (REEND(rp, scan) == RESTART(rp, scan)) { /* null match */
这是您的问题中<space>*
匹配的情况.该实现没有增加nf
,也就是说,它认为整行是一个单独的字段.请注意,此功能也在do_split()
函数中使用.
here is the case of <space>*
matching in your question. The implementation didn't increment the nf
, that is, it thinks the whole line is one single field. Note this function was used in do_split()
function too.
首先,如果FS
为空字符串,则gawk将每个字符分隔到其自己的字段中. gawk的文档清楚地用代码编写了此代码,我们可以看到:
First, if FS
is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:
line 613:
* null_parse_field --- each character is a separate field
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is the null string.
*/
static long
null_parse_field(long up_to,
如果FS
具有单个字符,则awk不会将其视为正则表达式.在文档中也提到了这一点.同样在代码中:
If the FS
has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:
#line 667
* sc_parse_field --- single character field separator
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a single character
* other than space.
*/
static long
sc_parse_field(l
如果我们阅读该功能,则不会在其中进行任何正则表达式匹配处理.
if we read the function, no regex match handling was done there.
在函数re_parse_field()
和sc_parse_field()
的注释中,我们看到do_split
也会调用它们.它解释了为什么在以下命令中使用1
而不是3
In the comments of the function re_parse_field()
, and sc_parse_field()
, we see do_split
invokes them too. It explains why we have 1
in following command instead of 3
:
kent$ echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1
注意,为避免过长,我没有在此处粘贴完整的代码,我们可以在此处找到代码:
Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:
http://git.savannah.gnu.org/cgit/gawk.git/
这篇关于gawk FS将记录拆分为单个字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!