使用awk修复断开的csv文件 [英] Fixing broken csv files using awk

查看:278
本文介绍了使用awk修复断开的csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些csv文件被打破,因为有一些垃圾,如控制字符,输入和分隔符在一些字段。没有控制字符的示例模型数据:

  id; col 1; col 2; col 3 
1;数据11 ;好21;数据31
2;数据12;剪切
两个;数据32
3;数据13;好23;数据33
4;数据14;额外分隔符数据34
5;数据15;好25;数据35
6;数据16;剪切
和;额外分隔符;数据36
7;数据17 27;数据37
8;数据18;在
中删除
三;数据38
9;数据19;数据29;数据39



我正在使用awk处理上述垃圾:

  BEGIN {FS = OFS =; }#delimiters 
NR == 1 {nf = NF; }#header record is fine,use the NF
NR> 1 {
if(NF< nf){#如果NF小于标题的NF
prev = $ 0#store $ 0
if(getline == 1){#读下一行
succ = $ 0#设置下一行为succ
$ 0 = prev succ#重建当前记录
}
}
if(NF!= nf)#如果NF仍然不足
$ 0 = succ#期望原始行格式不正确
if(NF!= nf) nextline格式不正确
next#well skipnextline and move to next
} 1

当然,上面的程序会失败记录 4 6 多个字段,其中额外的分隔符可能是lurk)和 8 (因为我只读下一行,如果 NF 我可以活着放弃 4 6 8 可能是可行的?



此外,连续三个 if >循环,但这是星期五下午,我的一天接近 $ ,我只是不能旋转我的头了。你有什么脑保留留下我可以借?我没有想到的任何最佳实践?

解决方案

关键她要保留一个缓冲区, 完成;一旦它们是,打印它们并清除缓冲区:

  awk -F';''NF> = 4& !nf {print;接下来}#正常行打印
{#否则,
如果(nf> 0){#继续一个断行...
buff = buff OFS $ 0#缓冲区
nf + = NF-1#并添加NF
} else {#newbrokenline,so ...
buff = $ 0#start buffer
nf = NF #set number of fields already seen
}
}
nf> = 4 {#一行完成
print buff#print it
buff =; nf = 0#并删除变量
}'文件

这里, buff 是这样的缓冲区, nf 是一个内部计数器,用于跟踪当前记录到目前为止已经看到的字段数



当添加到缓冲区时,我们添加 NF-1 因为具有 NF == 1 的行不会添加任何记录,而只是与上一行的最后一个字段连接:

  8; data 18; cut#NF == 3 | 
in#NF == 1但它只是继续$ 3 |所有在一起,NF == 4
三;数据38#NF == 2但$ 1继续$ 3 |

使用您的示例输入:

  $ awk -F';''NF> = 4&& !nf {print; next} {buff =(nf> 0?buff OFS:)$ 0; nf + =(nf>0≤NF-1:NF)} nf> = 4 {print buff; buff = nf = 0}'a 
id; col 1; col 2; col 3
1; data 11; good 21; data 31
2; data 12; cut in two; data 32
3; data 13; good 23; data 33
4; data 14; has; extra delimiter; data 34
5; data 15; good 25; data 35
6; data 16;剪切和额外分隔符;数据36
7;数据17;数据27;数据37
8;数据18;剪切三;数据38
9;数据19;数据29 ; data 39


I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:

id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut
in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut
and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut
in 
three;data 38
9;data 19;data 29;data 39

I am processing above crap with awk:

BEGIN { FS=OFS=";" }       # delimiters
NR==1 { nf=NF; }           # header record is fine, use the NF
NR>1 {
    if(NF<nf) {            # if NF less that header's NF
        prev=$0            # store $0
        if(getline==1) {   # read the "next" line
            succ=$0        # set the "next" line to succ
            $0=prev succ   # rebuild a current record
        }
    }
    if(NF!=nf)             # if NF is still not adequate
        $0=succ            # expect original line to be malformed
    if(NF!=nf)             # if the "next" line was malformed as well
        next               # well skip "next" line and move to next
} 1

Naturally above program will fail records 4 and 6 (as the actual data has several fields where the extra delimiter may lurk) and 8 (since I only read the next line if NF is too short. I can live with loosing 4 and 6 but 8 might be doable?

Also, three successive ifs scream for a for loop but it's Friday afternoon here and my day is nearing $ and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?

解决方案

The key her is to keep a buffer containing the lines that are still not "complete"; once they are, print them and clear the buffer:

awk -F';' 'NF>=4 && !nf {print; next}   # normal lines are printed
           {                            # otherwise,
                if (nf>0) {             # continue with a "broken" line by...
                    buff=buff OFS $0      # appending to the buffer
                    nf+=NF-1              # and adding NF
                } else {                # new "broken" line, so...
                    buff=$0               # start buffer
                    nf=NF                 # set number of fields already seen
                }
            }
           nf>=4{                       # once line is complete
              print buff                # print it
              buff=""; nf=0             # and remove variables
           }' file

Here, buff is such buffer and nf an internal counter to keep track of how many fields have been seen so far for the current record (like you did in your attempt).

We are adding NF-1 when appending to the buffer (that is, from the 2nd line of a broken stream) because a line with NF==1 does not add any record but just concatenates with the last field of the previous line:

8;data 18;cut        # NF==3                           |
in                   # NF==1 but it just continues $3  | all together, NF==4
three;data 38        # NF==2 but $1 continues $3       |

With your sample input:

$ awk -F';' 'NF>=4 && !nf {print; next} {buff=(nf>0 ? buff OFS : "") $0; nf+=(nf>0 ? NF-1 : NF)} nf>=4{print buff; buff=""; nf=0}' a
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut in  three;data 38
9;data 19;data 29;data 39

这篇关于使用awk修复断开的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆