使用 awk 有效解析 CSV 的最可靠方法是什么? [英] What's the most robust way to efficiently parse CSV using awk?

查看:30
本文介绍了使用 awk 有效解析 CSV 的最可靠方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题的目的是提供一个规范的答案.

The intent of this question is to provide a canonical answer.

给定一个可能由 Excel 或其他工具生成的 CSV 文件,其中包含嵌入式换行符、嵌入式双引号和空字段,例如:

Given a CSV as might be generated by Excel or other tools with embedded newlines, embedded double quotes and empty fields like:

$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4

有效地使用 awk 来识别单独的记录和字段的最健壮的方法是什么:

What's the most robust way efficiently using awk to identify the separate records and fields:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

因此它可以在 awk 脚本的其余部分内部用作那些记录和字段.

so it can be used as those records and fields internally by the rest of the awk script.

有效的 CSV 应符合 RFC 4180 或可以生成通过 MS-Excel.

A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.

解决方案必须容忍记录的结尾只是 LF ( ),这是典型的 UNIX 文件,而不是 CRLF ( ) 作为标准需要和 Excel 或其他 Windows 工具将生成.它还可以容忍不带引号的字段与带引号的字段混合在一起.它特别不需要容忍使用前面的反斜杠(即 " 而不是 ")转义 " ,因为某些其他 CSV 格式允许 - 如果你有,那么在前面添加一个 gsub(/\"/,"""") 将处理它并尝试处理两个转义自动在一个脚本中的机制会使脚本变得不必要的脆弱和复杂.

The solution must tolerate the end of record just being LF ( ) as is typical for UNIX files rather than CRLF ( ) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. " instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\"/,"""") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.

推荐答案

如果您的 CSV 不能包含换行符或转义双引号,那么您只需要(使用 GNU awk for FPAT):

If your CSV cannot contain newlines or escaped double quotes then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,with,commas",bar' |
    awk -v FPAT='[^,]*|"[^"]+"' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}'
1 <foo>
2 <"field,with,commas">
3 <bar>

否则,适用于任何现代 awk 的更通用、更强大、更便携的解决方案是:

Otherwise, though, the more general, robust, portable solution that will work with any modern awk is:

$ cat decsv.awk
function buildRec(      i,orig,fpat,done) {
    $0 = PrevSeg $0
    if ( gsub(/"/,"&") % 2 ) {
        PrevSeg = $0 RS
        done = 0
    }
    else {
        PrevSeg = ""
        gsub(/@/,"@A"); gsub(/""/,"@B")            # <"x@foo""bar"> -> <"x@Afoo@Bbar">
        orig = $0; $0 = ""                         # Save $0 and empty it
        fpat = "([^" FS "]*)|("[^"]+")"         # Mimic GNU awk FPAT meaning
        while ( (orig!="") && match(orig,fpat) ) { # Find the next string matching fpat
            $(++i) = substr(orig,RSTART,RLENGTH)   # Create a field in new $0
            gsub(/@B/,""",$i); gsub(/@A/,"@",$i)  # <"x@Afoo@Bbar"> -> <"x@foo"bar">
            gsub(/^"|"$/,"",$i)                    # <"x@foo"bar">   -> <x@foo"bar>
            orig = substr(orig,RSTART+RLENGTH+1)   # Move past fpat+sep in orig $0
        }
        done = 1
    }
    return done
}

BEGIN { FS=OFS="," }
!buildRec() { next }
{
    printf "Record %d:
", ++recNr
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/
/," ",$i) here
        printf "    $%d=<%s>
", i, $i
    }
    print "----"
}

.

$ awk -f decsv.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

以上假设 的 UNIX 行结尾.使用 Windows 行结尾,它比换行符"简单得多.每个字段内实际上只是换行符(即 s),因此您可以设置 RS= "(使用 GNU awk 进行多char RS),然后字段中的 不会被视为行尾.

The above assumes UNIX line endings of . With Windows line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. s) and so you can set RS=" " (using GNU awk for multi-char RS) and then the s within fields will not be treated as line endings.

它的工作原理是在遇到 RS 时,简单地计算当前记录中到目前为止存在多少 " - 如果它是奇数,则 RS(大概是 但不一定是)是中场,所以我们继续建立当前记录,但如果它是,那么它就是当前记录的结尾,这样我们就可以继续脚本的其余部分处理现在完整的记录.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.

gsub(/@/,"@A");gsub(/""/,"@B") 转换整个记录中的每一对双引号(请记住这些 "" 对只能在带引号的字段)转换为不包含双引号的字符串 @B ,这样当我们将记录拆分为字段时,match() 不会被字段内出现的引号绊倒.gsub(/@B/,""",$i);gsub(/@A​​/,"@",$i) 单独恢复每个字段内的引号,并将 ""s 转换为 "他们真正代表.

The gsub(/@/,"@A"); gsub(/""/,"@B") converts every pair of double quotes axcross the whole record (bear in mind these "" pairs can only apply within quoted fields) to a string @B that does not contain a double quote so that when we split the record into fields the match() doesn't get tripped up by quotes appearing inside fields. The gsub(/@B/,""",$i); gsub(/@A/,"@",$i) restores the quotes inside each field individually and also converts the ""s to the "s they really represent.

另请参阅如何在 cygwin 下使用 awk 打印 Excel 电子表格中的字段?了解如何从Excel 电子表格.

Also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

这篇关于使用 awk 有效解析 CSV 的最可靠方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆