AWK获取包含逗号和换行符的.csv列 [英] Awk to get .csv column containing commas and newlines

查看:196
本文介绍了AWK获取包含逗号和换行符的.csv列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在.csv列中有数据,该列有时包含逗号和换行符.如果我的数据中有逗号,则将整个字符串用双引号引起来.考虑到换行符和逗号,我将如何将该列的输出解析为.txt文件.

I have data in a .csv column that sometimes contains commas and newlines. If there is a comma in my data, I have enclosed the entire string with double quotes. How would I go about parsing the output of that column to a .txt file taking the newlines and commas into consideration.

对不适用于我的命令的数据进行采样:

Sample data that doesn't work with my command:

,"This is some text with a , in it.", #data with commas are enclosed in double quotes

,line 1 of data
line 2 of data, #data with a couple of newlines

,"Data that may a have , in it and
also be on a newline as well.",

这是我到目前为止所拥有的:

Here is what I have so far:

awk -F "\"*,\"*" '{print $4}' file.csv > column_output.txt

推荐答案

$ cat decsv.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; OFS="," }
{
    # create strings that cannot exist in the input to map escaped quotes to
    gsub(/a/,"aA")
    gsub(/\\"/,"aB")
    gsub(/""/,"aC")

    # prepend previous incomplete record segment if any
    $0 = prev $0
    numq = gsub(/"/,"&")
    if ( numq % 2 ) {
        # this is inside double quotes so incomplete record
        prev = $0 RT
        next
    }
    prev = ""

    for (i=1;i<=NF;i++) {
        # map the replacement strings back to their original values
        gsub(/aC/,"\"\"",$i)
        gsub(/aB/,"\\\"",$i)
        gsub(/aA/,"a",$i)
    }

    printf "Record %d:\n", ++recNr
    for (i=0;i<=NF;i++) {
        printf "\t$%d=<%s>\n", i, $i
    }
    print "#######"

.

$ awk -f decsv.awk file
Record 1:
        $0=<,"This is some text with a , in it.", #data with commas are enclosed in double quotes>
        $1=<>
        $2=<"This is some text with a , in it.">
        $3=< #data with commas are enclosed in double quotes>
#######
Record 2:
        $0=<,"line 1 of data
line 2 of data", #data with a couple of newlines>
        $1=<>
        $2=<"line 1 of data
line 2 of data">
        $3=< #data with a couple of newlines>
#######
Record 3:
        $0=<,"Data that may a have , in it and
also be on a newline as well.",>
        $1=<>
        $2=<"Data that may a have , in it and
also be on a newline as well.">
        $3=<>
#######
Record 4:
        $0=<,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",>
        $1=<>
        $2=<"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.">
        $3=<>
#######

以上将GNU awk用于FPAT和RT.我不知道任何CSV格式都可以让您在不带引号的字段中间使用换行符(如果这样的话,您将永远不知道任何记录在何处结束),因此脚本不允许那.上面是在此输入文件上运行的:

The above uses GNU awk for FPAT and RT. I don't know of any CSV format that would allow you to have a newline in the middle of a field that's not enclosed by quotes (if it did you'd never know where any record ended) so the script doesn't allow for that. The above was run on this input file:

$ cat file
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,"line 1 of data
line 2 of data", #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",

这篇关于AWK获取包含逗号和换行符的.csv列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆