使用awk有效解析CSV的最健壮方法是什么? [英] What's the most robust way to efficiently parse CSV using awk?

查看:118
本文介绍了使用awk有效解析CSV的最健壮方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题的目的是提供一个规范的答案.

The intent of this question is to provide a canonical answer.

给出一个由Excel或其他工具生成的CSV,其中包含嵌入的换行符,嵌入的双引号和空白字段,例如:

Given a CSV as might be generated by Excel or other tools with embedded newlines, embedded double quotes and empty fields like:

$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4

使用awk来识别单独的记录和字段的最有效方法是什么?

What's the most robust way efficiently using awk to identify the separate records and fields:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

因此它可以被awk脚本的其余部分在内部用作那些记录和字段.

so it can be used as those records and fields internally by the rest of the awk script.

有效CSV是符合 RFC 4180 的CSV或可由MS- Excel.

A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.

该解决方案必须允许记录的结尾仅为UNIX文件中常见的LF(\n),而不是标准要求的Excel和其他Windows工具生成的CRLF(\r\n).它还可以容忍未加引号的字段与加引号的字段混合在一起.由于某些其他CSV格式允许,因此它特别不需要忍受使用反斜杠(即\"而不是"")转义"的情况-如果您有,那么在前面添加gsub(/\\"/,"\"\"")即可处理该问题,并且试图在一个脚本中自动处理两种转义机制将使脚本不必要地脆弱和复杂.

The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.

推荐答案

如果您的CSV不能包含换行符或转义的双引号,那么您所需要的就是(对于

If your CSV cannot contain newlines or escaped double quotes then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,with,commas",bar' |
    awk -v FPAT='[^,]*|"[^"]+"' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}'
1 <foo>
2 <"field,with,commas">
3 <bar>

不过,否则,将与任何现代awk一起使用的更通用,更健壮,可移植的解决方案是:

Otherwise, though, the more general, robust, portable solution that will work with any modern awk is:

$ cat decsv.awk
function buildRec(      i,orig,fpat,done) {
    $0 = PrevSeg $0
    if ( gsub(/"/,"&") % 2 ) {
        PrevSeg = $0 RS
        done = 0
    }
    else {
        PrevSeg = ""
        gsub(/@/,"@A"); gsub(/""/,"@B")            # <"x@foo""bar"> -> <"x@Afoo@Bbar">
        orig = $0; $0 = ""                         # Save $0 and empty it
        fpat = "([^" FS "]*)|(\"[^\"]+\")"         # Mimic GNU awk FPAT meaning
        while ( (orig!="") && match(orig,fpat) ) { # Find the next string matching fpat
            $(++i) = substr(orig,RSTART,RLENGTH)   # Create a field in new $0
            gsub(/@B/,"\"",$i); gsub(/@A/,"@",$i)  # <"x@Afoo@Bbar"> -> <"x@foo"bar">
            gsub(/^"|"$/,"",$i)                    # <"x@foo"bar">   -> <x@foo"bar>
            orig = substr(orig,RSTART+RLENGTH+1)   # Move past fpat+sep in orig $0
        }
        done = 1
    }
    return done
}

BEGIN { FS=OFS="," }
!buildRec() { next }
{
    printf "Record %d:\n", ++recNr
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    }
    print "----"
}

.

$ awk -f decsv.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

以上假设UNIX行结尾为\n.使用Windows \r\n行末尾要简单得多,因为每个字段中的换行符"实际上只是换行符(即\n s),因此您可以设置RS="\r\n"(使用GNU awk进行多字符RS)和那么字段中的\n不会被视为行尾.

The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.

它的工作原理是,只要遇到RS,就简单地计算当前记录中到目前为止有多少"-如果它是奇数,则RS(大概是\n,但没有是)是中场,因此我们继续构建当前记录,但即使是它,它也要在当前记录的结尾,因此我们可以继续执行剩下的脚本来处理现在完整的记录.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.

gsub(/@/,"@A"); gsub(/""/,"@B")将整个记录中的每对双引号都转换为整条记录(请注意,这些""对只能在带引号的字段中使用)到不包含双引号的字符串@B,以便当我们将记录分为字段match()不会被字段中出现的引号绊倒. gsub(/@B/,"\"",$i); gsub(/@A/,"@",$i)分别还原每个字段中的引号,并将""转换为它们真正代表的".

The gsub(/@/,"@A"); gsub(/""/,"@B") converts every pair of double quotes axcross the whole record (bear in mind these "" pairs can only apply within quoted fields) to a string @B that does not contain a double quote so that when we split the record into fields the match() doesn't get tripped up by quotes appearing inside fields. The gsub(/@B/,"\"",$i); gsub(/@A/,"@",$i) restores the quotes inside each field individually and also converts the ""s to the "s they really represent.

另请参见如何在cygwin下使用awk从excel电子表格中打印字段?以了解如何从excel电子表格中生成CSV Excel电子表格.

Also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

这篇关于使用awk有效解析CSV的最健壮方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆