读取嵌入了换行符的CSV文件 [英] Read CSV file with embedded newlines

查看:392
本文介绍了读取嵌入了换行符的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理从网站上抓取的文件,该文件保存为带有引号字段的分号csv。
最后一个字段包含嵌入的换行符。
我一直在处理一个脚本来处理文件。
我是perl的新手,最初是用一个正常的perl脚本,但很快发现,这是不工作。
我做了我的研究,发现我应该使用Text :: CSV模块。我遇到了这些网站,解释了如何使用该模块:

i'm working on a file i have scraped from a website, the file is saved as a semicolon csv with quoted fields. The last field contains embedded newlines. I've been working on a script to proces the file. I'm fairly new to perl and at first is was trying it with a normal perl script but quickly found out that wasn't working. I did my research and found out I should use the Text::CSV module instead. I came across these sites which explained how to use the module:

http://perlmaven.com/how-to-read-a-csv-file-using-perl

http://perlmeme.org/tutorials/parsing_csv.html

http://metacpan.org/ pod / Text :: CSV#嵌入的换行符

基本上我想要完成的是正确读取文件,以便所有字段获得分隔正确地,而不是在换行。

Basically what i'm trying to accomplish is to read the file correctly so that all the fields get delimited properly instead of breaking off at a newline. Then removing the newlines from that field and write it to a new file.

以下是原始数据的示例:

Here is an example of the original data:

 "2030";"NH Amersfoort";"Stationsstraat 75";"3811 MH AMERSFOORT";"033-4221200";"www.nh-hotels.nl";"52.154316";"5.380036";"<UL class=stars><LI>
 <LI>
 <LI>
 <LI></LI></UL>"
 "2031";"NH Amsterdam Centre";"Stadhouderskade 7";"1054 ES AMSTERDAM";"020-6851351";"www.nh-hotels.com";"52.363075";"4.879458";"<UL class=stars><LI>
 <LI>
 <LI>
 <LI></LI></UL>"
 "2032";"NH Atlanta Rotterdam Hotel";"Aert van Nesstraat 4";"3012 CA ROTTERDAM";"010-2067800";"www.nh-hotels.com";"51.921028";"4.478619";"<UL class=stars><LI>
 <LI>
 <LI>
 <LI></LI></UL>" 

我想要的是:

 "2030";"NH Amersfoort";"Stationsstraat 75";"3811 MH AMERSFOORT";"033-4221200";"www.nh-hotels.nl";"52.154316";"5.380036";"<UL class=stars><LI><LI><LI><LI></LI></UL>"
 "2031";"NH Amsterdam Centre";"Stadhouderskade 7";"1054 ES AMSTERDAM";"020-6851351";"www.nh-hotels.com";"52.363075";"4.879458";"<UL class=stars><LI><LI><LI><LI></LI></UL>"
 "2032";"NH Atlanta Rotterdam Hotel";"Aert van Nesstraat 4";"3012 CA ROTTERDAM";"010-2067800";"www.nh-hotels.com";"51.921028";"4.478619";"<UL class=stars><LI><LI><LI><LI></LI></UL>" 

这是我的完整脚本。我尝试了10个不同的选项和建议,他们都不工作!

This is my full script so far. I have tried 10 different options and suggestions and they're all not working!

 use strict;
 use warnings;    
 use Text::CSV;

 my $inputfile  = shift || die "Give input and output names!\n";
 my $outputfile = shift || die "Give output name!\n";

 open my $infile,  '<', $inputfile   or die "Sourcefile in use / not found :$!\n";
 open my $outfile, '>', $outputfile  or die "Outputfile in use :$!\n";

    my $csv = Text::CSV->new ({
binary => 1,
sep_char => ';'
});

while (my $elements = $csv->getline( $infile )) {
        my $stars = $elements->[8];
        #$ster =~ s/[\r\n]//g
        print "$stars\n\n";
        }

 close $infile;
 close $outfile;

这会正确输出换行符,但没有将其从课程中删除。我怎么做?使用正则表达式替换换行符不起作用。接下来的问题是当我确定如何清理该字段..如何打印新文件?

This prints the field with the newlines in it correctly but hasn't removed them off course. How do i do that? Using a regex to substitute the newlines is not working. And the next question is when I do figure out how to clean up that field.. How do i print the new file?

推荐答案

我不知道你在这里问什么,因为它似乎已经有你的答案。但是,此代码会起作用:

I'm not sure what you are asking here, because it seems you already have your answers. However, this code does work:

use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new ({
    binary => 1,
    sep_char => ';',
    eol => $/,                # to make $csv->print use newlines
    always_quote => 1,        # to keep your numbers quoted
});

while (my $row = $csv->getline( *DATA )) {
    $row->[8] =~ s/[\r\n]+//g;
    $csv->print(*STDOUT, $row);
}

__DATA__
"2030";"NH Amersfoort";"Stationsstraat 75";"3811 MH AMERSFOORT";"033-4221200";"www.nh-hotels.nl";"52.154316";"5.380036";"<UL class=stars><LI>
<LI>
<LI>
<LI></LI></UL>"
"2031";"NH Amsterdam Centre";"Stadhouderskade 7";"1054 ES AMSTERDAM";"020-6851351";"www.nh-hotels.com";"52.363075";"4.879458";"<UL class=stars><LI>
<LI>
<LI>
<LI></LI></UL>"
"2032";"NH Atlanta Rotterdam Hotel";"Aert van Nesstraat 4";"3012 CA ROTTERDAM";"010-2067800";"www.nh-hotels.com";"51.921028";"4.478619";"<UL class=stars><LI>
<LI>
<LI>
<LI></LI></UL>"

指针:

使用 eol 选项与 Text :: CSV 的打印功能可以实现您所期望的效果,打印换行符。我使用 STDOUT 作为输出句柄,但你可以使用任何你想要的文件句柄。

Using the eol option with Text::CSV's print makes it do what you expect, which is to print newlines. I used STDOUT as the output handle, but you can use any file handle you want.

我不知道你为什么说替代对你不起作用,但我怀疑你可能做了这样的事情:

I don't know why you say substitution does "not work" for you, but I suspect that perhaps you did something like this:

my $foo = $row->[8];
$foo =~ s/[\r\n]//g;
print @$row;

这不会更改 $ row ,只是 $ foo 中的副本。

This does not change the values in $row, just the copy in $foo.

这篇关于读取嵌入了换行符的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆