使用awk或perl从CSV提取特定列(解析) [英] Using awk or perl to extract specific columns from CSV (parsing)

查看:222
本文介绍了使用awk或perl从CSV提取特定列(解析)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景-我想从csv文件中提取特定的列. csv文件以逗号分隔,使用双引号作为文本限定符(可选,但是当字段包含特殊字符时,限定符将在此处-请参见示例),并使用反斜杠作为转义字符.某些字段也可能为空白.

Background - I want to extract specific columns from a csv file. The csv file is comma delimited, uses double quotes as the text-qualifier (optional, but when a field contains special characters, the qualifier will be there - see example), and uses backslashes as the escape character. It is also possible for some fields to be blank.

示例输入和所需的输出-例如,我只希望第1、3和4列位于输出文件中. csv文件中各列的最终摘录应与原始文件的格式匹配.不得删除转义字符或添加额外的引号等.

Example Input and Desired Output - For example, I only want columns 1, 3, and 4 to be in the output file. The final extract of the columns from the csv file should match the format of the original file. No escape characters should be removed or extra quotes added and such.

输入

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A
"Jane, Mary","",132 CBS Street,333-111-5332,"F",B
"Smith \"Jr.\", Jane",35,,555-876-1233,"F",
"Lee, Jack",22,123 Sesame St,"","M",D

所需的输出

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""


初步脚本(awk)-以下是我发现的初步脚本,该脚本在大多数情况下都有效,但在我注意到的一个特定实例中不起作用,并且可能在我未曾看到的更多实例中起作用或想到了


Preliminary Script (awk) - The following is a preliminary script I found that works for the most part, but does not work in one particular instance that I noticed and possibly more that I have not seen or thought of yet

#!/usr/xpg4/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"+$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
print $1,$3,$4
}

上面的方法似乎很好用,直到遇到一个既包含转义双引号又包含逗号的字段.在这种情况下,解析将关闭并且输出将不正确.

The above seems to work well until it comes across a field that contains both escaped double quotes as well as a comma. In that case, the parsing will be off and the output will be incorrect.

问题/评论-我读过awk不是解析csv文件的最佳选择,建议使用perl.但是,我一点也不了解perl.我已经找到了一些perl脚本的示例,但是它们并没有提供我想要的输出,而且我也不知道如何轻松地针对自己的需要编辑脚本.

Question/Comments - I have read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl at all. I have found some examples of perl scripts, but they do not give the desired output I am looking for and I do not know how to edit the scripts easily for what I want.

至于awk,我熟悉它并偶尔使用它的基本功能,但是我不了解很多高级功能,例如上面脚本中使用的某些命令.仅通过使用awk可以实现所需的输出吗?如果是这样,是否可以编辑上面的脚本来解决我遇到的问题?有人可以逐行解释脚本到底在做什么吗?

As for awk, I am familiar with it and use the basic functionality of it occasionally, but I do not know a lot of the advanced functionality like some of the commands used in the script above. Is my desired output possible just by using awk? If so, would it be possible edit the script above to fix the issue I am having with it? Could someone explain line by line what exactly the script is doing?

任何帮助将不胜感激,谢谢!

Any help would be appreciated, thanks!

推荐答案

我不会重新发明方向盘.

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
   binary      => 1,
   escape_char => '\\',
   eol         => "\n",
});

my $fh_in  = \*STDIN;
my $fh_out = \*STDOUT;

while (my $row = $csv->getline($fh_in)) {
   $csv->print($fh_out, [ @{$row}[0,2,3] ])
      or die("".$csv->error_diag());
}

$csv->eof()
   or die("".$csv->error_diag());

输出:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary","132 CBS Street",333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack","123 Sesame St",

它在没有地址的地方加上引号,但是由于某些地址周围已经有引号,因此您显然可以处理.

It adds quotes around addresses that didn't have any already, but since some addresses already have quotes around them, you obviously can handle that.

重新发明轮子:

my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s;
while (<>) {
   my @fields = /^($field),$field,($field),($field),/
      or die;
   print(join(',', @fields), "\n");
}

输出:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

这篇关于使用awk或perl从CSV提取特定列(解析)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆