awk程序文件执行 [英] awk Program File execution
问题描述
作为我的最后一个问题是越来越长,这里是一个浓缩版与目前的code级。
摘要:我需要在管道分隔的输入文件,检查并确保所有适用的记录类型是present,添加任何的丢失,并确认/纠正每一个记录类型中的子场数
输入记录:
<$p$p><$c$c>AA|1234|ABCD|EDGFT|TR56BE|~BB||E5TGE|~CC|253641|84597|~DD|78HND|ACBE|||43|~EE|HISBL|78943|~FF|12345|SKIP|~GG|||TYBGFRAA | 2345 | CDEF | GFHIT | 48UJKK |〜CC || 3FKTI
记录类型和子计数验证文件 known_flds
条目:
AA〜5〜REQ
BB〜2〜选择
CC〜3〜选择
DD〜6〜选择
EE〜4〜选择
FF〜2〜SKP
GG〜4〜选择
当前脚本,没有子修正:
#!的/ usr /斌/的awk -fBEGIN {FS = OFS =〜}FNR == {NR
dflts [$ 1] = create_empty_field($ 1,$ 2)
如果($ 3〜/ REQ |选择/)fld_order [++ fld_cnt] = $ 1
fld_rule [$ 1] = $ 3
下一个
}{
FLDS =
J = 1
对于(i = 1; I&LT; = fld_cnt;我++){
J = skip_flds(J) 如果($ J 1〜(^fld_order [I]))= FLD dflts [fld_order [I]
其他{FLD = $焦耳; J ++}
FLDS = FLDS(FLDS ==?:OFS)FLD
}
打印FLDS
}功能create_empty_field(姓名,CNT,FLD,I){
FLD =名称
对于(i = 1; I&LT; = CNT;我++){FLD = FLD| }
回报(FLD)
}功能skip_flds(FNUM,名){
名字= $ FNUM
子(/ \\ | * $ /,,名字)
而(fld_rule [名] ==SKP){
FNUM ++
名字= $ FNUM
子(/ \\ | * $ /,,名字)
}
回报(FNUM)
}
我在表演的子域的验证和修正最初的尝试:
#!的/ usr /斌/的awk -fBEGIN {FS = OFS =〜}FNR == {NR
dflts [$ 1] = create_empty_field($ 1,$ 2)
如果($ 3〜/ REQ |选择/)fld_order [++ fld_cnt] = $ 1
fld_rule [$ 1] = $ 3
下一个
}{
FLDS =
J = 1
对于(i = 1; I&LT; = fld_cnt;我++){
J = skip_flds(J)
如果($ J 1〜(^fld_order [I]))= FLD dflts [fld_order [I]
其他{FLD = fix_sub(附加$ J,$ 2); J ++}
FLDS = FLDS(FLDS ==?:OFS)FLD
}
打印FLDS
}功能create_empty_field(姓名,CNT,FLD,I){
FLD =名称
对于(i = 1; I&LT; = CNT;我++){FLD = FLD| }
回报(FLD)
}功能skip_flds(FNUM,名){
名字= $ FNUM
子(/ \\ | * $ /,,名字)
而(fld_rule [名] ==SKP){
FNUM ++
名字= $ FNUM
子(/ \\ | * $ /,,名字)
}
回报(FNUM)
}功能fix_sub(REC,NUM,UPD,CNT){
CNT =拆分(REC,一个|) - 1
UPD =
如果(CNT!= NUM)
{为(i = 1; I&LT; = $ NUM;我++)
UPD = UPD一个[I]| }
其他{UPD = $录音}
回报(UPD)
}
以上导致的错误,当它到达了第二个记录类型。所以,现在我知道我需要捕获从 known_flds
文件中的第2个数值,以便通过对传递到 fix_sub
功能。
我会加入:
sub_fld [$ 1] = $ 16
在 FNR == NR
部分,但除此之外,我的脑子简直是油炸,我无法前进。
我知道作为一个独立的 fix_sub
领域的工作。现在我只需要得到从 known_flds
读出的值通过。
所需的输出是:
<$p$p><$c$c>AA|1234|ABCD|EDGFT|TR56BE|~BB||~CC|253641|84597|~DD|78HND|ACBE|||43|~EE|HISBL|78943||~GG|||TYBGFRAA | 2345 | CDEF | GFHIT | 48UJKK |〜BB ||〜CC || 3FKTI |〜DD ||||||〜EE ||||〜GG |||
原题: UNIX Shell脚本解决方案用于格式化管道分隔,分段文件
试试这个修改后的脚本:
#!的/ usr /斌/的awk -fBEGIN {FS = OFS =〜}FNR == {NR
dflts [$ 1] = create_empty_field($ 1,$ 2)
如果($ 3〜/ REQ |选择/){
fld_order [++ fld_cnt] = $ 1
subfld_cnt [$ 1] = $ 16
}
fld_rule [$ 1] = $ 3
下一个
}{
FLDS =
J = 1
对于(i = 1; I&LT; = fld_cnt;我++){
J = skip_flds(J)
如果($ J 1〜(^fld_order [I]))= FLD dflts [fld_order [I]
其他{FLD = fix_sub(J); J ++}
FLDS = FLDS(FLDS ==?:OFS)FLD
}
打印FLDS
}功能get_field_name(FNUM,名){
名字= $ FNUM
子(/ \\ | * $ /,,名字)
回报(名)
}功能create_empty_field(姓名,CNT,FLD,I){
FLD =名称
对于(i = 1; I&LT; = CNT;我++){FLD = FLD| }
回报(FLD)
}功能skip_flds(FNUM,名){
名称= get_field_name(FNUM)
而(fld_rule [名] ==SKP){
FNUM ++
名字= $ FNUM
子(/ \\ | * $ /,,名字)
}
回报(FNUM)
}功能fix_sub(FNUM,名称,CNT,一,体细胞核移植,我,UPD){
名称= get_field_name(FNUM)
CNT =拆分($ FNUM,一个|) - 1
SCNT = subfld_cnt [名]
如果(CNT!= SCNT){
对于(i = 1; I&LT; = SCNT;我++)
UPD = UPD一个[I]|
回报(UPD)
}
回报($ FNUM)
}
关键的区别:
-
subfld_cnt [$ 1 = $ 2
已添加到REQ |选择
部分中的FNR == NR
块(处理known_flds
文件) - 新增
get_field_name()
函数返回其FNUM
参数指定的字段的第一子域。 - 名为
get_field_name()
从功能skip_flds()
- 修改
fix_sub()
来只拿FNUM
(所有其他变量是本地的功能)和如有必要,固定子场管的数量。现在在调用,只需要一个Ĵ
参数为fix_sub(J)
。
的 fix_sub()
变动明细:
-
NAME = get_field_name(FNUM)
来获取查询的字段名称 -
拆分
的$ FNUM
,并获得分裂的计数(在-1调整离开) -
SCNT = subfld_cnt [名]
得到的加入到了加工阵列所需的字段计数known_flds
文件。这是你丢失的主件。 - 当
CNT!= SCNT
修复subflds。 - 在
UPD
设置code离开了,但是去掉了UPD =
- 这是已经完成局部变量。 - 个人preference - 无论是与价值,而不是
其他
直接返回 。
我收到以下内容:
AA | 1234 | ABCD | EDGFT | TR56BE |〜BB ||〜CC | 253641 | 84597 |〜DD | 78HND | ACBE ||| 43 |〜EE | HISBL | 78943
||〜GG ||| TYBGFR |
AA | 2345 | CDEF | GFHIT | 48UJKK |〜BB ||〜CC || 3FKTI |〜DD ||||||〜EE ||||〜GG ||||
这并不完全符合你的期望的输出。所不同的是在最后的 |在
。我想你所需的输出丢失了。否则,最终字段的最后管仅仅需要所有其它处理之后被丢弃。 GG
字段
As my last question was getting to long, here is a condensed version with the current code level.
Summary: I need to take in a pipe-delimited input file, check to ensure all applicable record types are present, add any that are missing, and verify/correct the number of subfields within each record type.
Input records:
AA|1234|ABCD|EDGFT|TR56BE|~BB||E5TGE|~CC|253641|84597|~DD|78HND|ACBE|||43|~EE|HISBL|78943|~FF|12345|SKIP|~GG|||TYBGFR
AA|2345|CDEF|GFHIT|48UJKK|~CC||3FKTI
Record type and subfield count validation file known_flds
entries:
AA~5~req
BB~2~opt
CC~3~opt
DD~6~opt
EE~4~opt
FF~2~skp
GG~4~opt
Current script, without the subfield correction:
#!/usr/bin/awk -f
BEGIN { FS=OFS="~" }
FNR==NR {
dflts[$1] = create_empty_field($1,$2)
if( $3 ~ /req|opt/ ) fld_order[++fld_cnt] = $1
fld_rule[$1] = $3
next
}
{
flds = ""
j = 1
for(i=1; i<=fld_cnt; i++) {
j = skip_flds( j )
if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
else { fld = $j; j++ }
flds = flds (flds=="" ? "" : OFS) fld
}
print flds
}
function create_empty_field(name, cnt, fld, i) {
fld = name
for(i=1; i<=cnt; i++) { fld = fld "|" }
return( fld )
}
function skip_flds(fnum, name) {
name = $fnum
sub(/\|.*$/, "", name)
while(fld_rule[name] == "skp") {
fnum++
name = $fnum
sub(/\|.*$/, "", name)
}
return( fnum )
}
My initial attempt at performing the validation and correction of the subfields:
#!/usr/bin/awk -f
BEGIN { FS=OFS="~" }
FNR==NR {
dflts[$1] = create_empty_field($1,$2)
if( $3 ~ /req|opt/ ) fld_order[++fld_cnt] = $1
fld_rule[$1] = $3
next
}
{
flds = ""
j = 1
for(i=1; i<=fld_cnt; i++) {
j = skip_flds( j )
if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
else { fld = fix_sub($j,$2); j++ }
flds = flds (flds=="" ? "" : OFS) fld
}
print flds
}
function create_empty_field(name, cnt, fld, i) {
fld = name
for(i=1; i<=cnt; i++) { fld = fld "|" }
return( fld )
}
function skip_flds(fnum, name) {
name = $fnum
sub(/\|.*$/, "", name)
while(fld_rule[name] == "skp") {
fnum++
name = $fnum
sub(/\|.*$/, "", name)
}
return( fnum )
}
function fix_sub(rec, num, upd, cnt) {
cnt=split(rec,a,"|")-1
upd=""
if(cnt != num)
{for(i=1;i<=$num;i++)
upd = upd a[i] "|" }
else { upd=$rec }
return(upd)
}
The above resulted in errors when it reached the second record type. So now I know that I need to capture the 2nd value from the known_flds
file in order to pass that through to the fix_sub
function.
I will be adding:
sub_fld[$1] = $2
In the FNR==NR
section, but beyond that, my brain is simply fried and I cannot move forward.
I know as a standalone, the fix_sub
area works. Now I just need to get the value read from known_flds
to pass through.
The desired output is:
AA|1234|ABCD|EDGFT|TR56BE|~BB||~CC|253641|84597|~DD|78HND|ACBE|||43|~EE|HISBL|78943||~GG|||TYBGFR
AA|2345|CDEF|GFHIT|48UJKK|~BB||~CC||3FKTI|~DD||||||~EE||||~GG|||
Original question: UNIX Shell Script Solution for formatting a pipe-delimited, segmented file
Try this modified script:
#!/usr/bin/awk -f
BEGIN { FS=OFS="~" }
FNR==NR {
dflts[$1] = create_empty_field($1,$2)
if( $3 ~ /req|opt/ ) {
fld_order[++fld_cnt] = $1
subfld_cnt[$1] = $2
}
fld_rule[$1] = $3
next
}
{
flds = ""
j = 1
for(i=1; i<=fld_cnt; i++) {
j = skip_flds( j )
if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
else { fld = fix_sub(j); j++ }
flds = flds (flds=="" ? "" : OFS) fld
}
print flds
}
function get_field_name(fnum, name) {
name = $fnum
sub(/\|.*$/, "", name)
return( name )
}
function create_empty_field(name, cnt, fld, i) {
fld = name
for(i=1; i<=cnt; i++) { fld = fld "|" }
return( fld )
}
function skip_flds(fnum, name) {
name = get_field_name(fnum)
while(fld_rule[name] == "skp") {
fnum++
name = $fnum
sub(/\|.*$/, "", name)
}
return( fnum )
}
function fix_sub(fnum, name, cnt, a, scnt, i, upd) {
name = get_field_name(fnum)
cnt = split($fnum, a, "|")-1
scnt = subfld_cnt[ name ]
if(cnt != scnt) {
for(i=1;i<=scnt;i++)
upd = upd a[i] "|"
return( upd )
}
return( $fnum )
}
The key differences:
subfld_cnt[$1] = $2
has been added to thereq|opt
section in theFNR==NR
block ( handling theknown_flds
file )- Added
get_field_name()
function which returns the first subfield of the field specified by itsfnum
argument. - Called
get_field_name()
from functionskip_flds()
- Modified
fix_sub()
to take only thefnum
( all the other variables are local to the function ) and fix the number of subfield pipes if necessary. Now the call to it only takes aj
argument as infix_sub(j)
.
Breakdown of fix_sub()
changes:
name = get_field_name(fnum)
to get the field name for lookupsplit
the$fnum
, and get the count of split (leaving in your -1 adjustment)scnt = subfld_cnt[ name ]
to get the desired field count from the array that was added to the processing of theknown_flds
file. This is primary piece you were missing.- When
cnt != scnt
fix the subflds. - Left in your
upd
setting code, but removed theupd = ""
- that's already done for local variables. - Personal preference - return directly with either value instead of the
else
.
I get the following:
AA|1234|ABCD|EDGFT|TR56BE|~BB||~CC|253641|84597|~DD|78HND|ACBE|||43|~EE|HISBL|78943
||~GG|||TYBGFR|
AA|2345|CDEF|GFHIT|48UJKK|~BB||~CC||3FKTI|~DD||||||~EE||||~GG||||
which doesn't exactly match your desired output. The difference is in the final |
in the GG
field. I think your desired output is missing it. Otherwise, the final pipe of the final field just needs to be dropped after all other processing.
这篇关于awk程序文件执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!