使用awk修复断开的csv文件 [英] Fixing broken csv files using awk
问题描述
我有一些csv文件被打破,因为有一些垃圾,如控制字符,输入和分隔符在一些字段。没有控制字符的示例模型数据:
id; col 1; col 2; col 3
1;数据11 ;好21;数据31
2;数据12;剪切
两个;数据32
3;数据13;好23;数据33
4;数据14;额外分隔符数据34
5;数据15;好25;数据35
6;数据16;剪切
和;额外分隔符;数据36
7;数据17 27;数据37
8;数据18;在
中删除
三;数据38
9;数据19;数据29;数据39
我正在使用awk处理上述垃圾:
BEGIN {FS = OFS =; }#delimiters
NR == 1 {nf = NF; }#header record is fine,use the NF
NR> 1 {
if(NF< nf){#如果NF小于标题的NF
prev = $ 0#store $ 0
if(getline == 1){#读下一行
succ = $ 0#设置下一行为succ
$ 0 = prev succ#重建当前记录
}
}
if(NF!= nf)#如果NF仍然不足
$ 0 = succ#期望原始行格式不正确
if(NF!= nf) nextline格式不正确
next#well skipnextline and move to next
} 1
当然,上面的程序会失败记录
4
和6
多个字段,其中额外的分隔符可能是lurk)和8
(因为我只读下一行,如果NF
我可以活着放弃4
和6
但8
可能是可行的?
此外,连续三个
if
对>循环,但这是星期五下午,我的一天接近
$
,我只是不能旋转我的头了。你有什么脑保留留下我可以借?我没有想到的任何最佳实践?解决方案关键她要保留一个缓冲区, 完成;一旦它们是,打印它们并清除缓冲区:
awk -F';''NF> = 4& !nf {print;接下来}#正常行打印
{#否则,
如果(nf> 0){#继续一个断行...
buff = buff OFS $ 0#缓冲区
nf + = NF-1#并添加NF
} else {#newbrokenline,so ...
buff = $ 0#start buffer
nf = NF #set number of fields already seen
}
}
nf> = 4 {#一行完成
print buff#print it
buff =; nf = 0#并删除变量
}'文件
这里,
buff
是这样的缓冲区,nf
是一个内部计数器,用于跟踪当前记录到目前为止已经看到的字段数
当添加到缓冲区时,我们添加
NF-1
因为具有NF == 1
的行不会添加任何记录,而只是与上一行的最后一个字段连接:8; data 18; cut#NF == 3 |
in#NF == 1但它只是继续$ 3 |所有在一起,NF == 4
三;数据38#NF == 2但$ 1继续$ 3 |
使用您的示例输入:
$ awk -F';''NF> = 4&& !nf {print; next} {buff =(nf> 0?buff OFS:)$ 0; nf + =(nf>0≤NF-1:NF)} nf> = 4 {print buff; buff = nf = 0}'a
id; col 1; col 2; col 3
1; data 11; good 21; data 31
2; data 12; cut in two; data 32
3; data 13; good 23; data 33
4; data 14; has; extra delimiter; data 34
5; data 15; good 25; data 35
6; data 16;剪切和额外分隔符;数据36
7;数据17;数据27;数据37
8;数据18;剪切三;数据38
9;数据19;数据29 ; data 39
I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:
id;col 1;col 2;col 3 1;data 11;good 21;data 31 2;data 12;cut in two;data 32 3;data 13;good 23;data 33 4;data 14;has;extra delimiter;data 34 5;data 15;good 25;data 35 6;data 16;cut and;extra delimiter;data 36 7;data 17;data 27;data 37 8;data 18;cut in three;data 38 9;data 19;data 29;data 39
I am processing above crap with awk:
BEGIN { FS=OFS=";" } # delimiters NR==1 { nf=NF; } # header record is fine, use the NF NR>1 { if(NF<nf) { # if NF less that header's NF prev=$0 # store $0 if(getline==1) { # read the "next" line succ=$0 # set the "next" line to succ $0=prev succ # rebuild a current record } } if(NF!=nf) # if NF is still not adequate $0=succ # expect original line to be malformed if(NF!=nf) # if the "next" line was malformed as well next # well skip "next" line and move to next } 1
Naturally above program will fail records
4
and6
(as the actual data has several fields where the extra delimiter may lurk) and8
(since I only read the next line ifNF
is too short. I can live with loosing4
and6
but8
might be doable?Also, three successive
if
s scream for afor
loop but it's Friday afternoon here and my day is nearing$
and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?解决方案The key her is to keep a buffer containing the lines that are still not "complete"; once they are, print them and clear the buffer:
awk -F';' 'NF>=4 && !nf {print; next} # normal lines are printed { # otherwise, if (nf>0) { # continue with a "broken" line by... buff=buff OFS $0 # appending to the buffer nf+=NF-1 # and adding NF } else { # new "broken" line, so... buff=$0 # start buffer nf=NF # set number of fields already seen } } nf>=4{ # once line is complete print buff # print it buff=""; nf=0 # and remove variables }' file
Here,
buff
is such buffer andnf
an internal counter to keep track of how many fields have been seen so far for the current record (like you did in your attempt).We are adding
NF-1
when appending to the buffer (that is, from the 2nd line of a broken stream) because a line withNF==1
does not add any record but just concatenates with the last field of the previous line:8;data 18;cut # NF==3 | in # NF==1 but it just continues $3 | all together, NF==4 three;data 38 # NF==2 but $1 continues $3 |
With your sample input:
$ awk -F';' 'NF>=4 && !nf {print; next} {buff=(nf>0 ? buff OFS : "") $0; nf+=(nf>0 ? NF-1 : NF)} nf>=4{print buff; buff=""; nf=0}' a id;col 1;col 2;col 3 1;data 11;good 21;data 31 2;data 12;cut in two;data 32 3;data 13;good 23;data 33 4;data 14;has;extra delimiter;data 34 5;data 15;good 25;data 35 6;data 16;cut and;extra delimiter;data 36 7;data 17;data 27;data 37 8;data 18;cut in three;data 38 9;data 19;data 29;data 39
这篇关于使用awk修复断开的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!