Perl中最快的CSV解析器 [英] Fastest CSV Parser in Perl

查看:172
本文介绍了Perl中最快的CSV解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个子程序:



(1)解析CSV文件;



)并检查该文件中的所有行是否具有预期的列数。



当行数从千到数百万不等时,您认为最多的列是什么?



现在,我正在尝试这些实现。



(1)基本文件解析器

 打开我的$ in_fh, ',$ file或
croak无法打开$ file':$ OS_ERROR;

my $ row_no = 0;
while(my $ row =< $ in_fh>){
my @values = split(q {,},$ row);
++ $ row_no;
if(scalar @values< $ min_cols_no){
croak无效的文件格式文件'$ file'在行'$ row_no'中没有'$ min_cols_no'列。
}
}

close $ in_fh
或croak无法关闭$ file':$ OS_ERROR;

(2)使用Text :: CSV_XS(bind_columns和csv-> getline) strong>

  my $ csv = Text :: CSV_XS-> new()或
croak无法使用CSV :。 Text :: CSV_XS-> error_diag();
open my $ in_fh,'<',$ file或
croak无法打开$ file':$ OS_ERROR;

my $ row_no = 1;
my @cols = @ {$ csv-> getline($ in_fh)};
my $ row = {};
$ csv-> bind_columns(\ @ {$ row} {@ cols});
while($ csv-> getline($ in_fh)){
++ $ row_no;
if(scalar keys%$ row< $ min_cols_no){
croak无效的文件格式文件'$ file'在行'$ row_no'中没有'$ min_cols_no'列。
}
}

$ csv-> eof或$ csv-> error_diag();
close $ in_fh或
croak无法关闭$ file':$ OS_ERROR;

(3)使用Text :: CSV_XS(csv-> parse)

  my $ csv = Text :: CSV_XS-> new()或
croak无法使用CSV: 。 Text :: CSV_XS-> error_diag();
open my $ in_fh,'<',$ file或
croak无法打开$ file':$ OS_ERROR;

my $ row_no = 0;
while(< $ in_fh>){
$ csv-> parse($ _);
++ $ row_no;
if(scalar $ csv-> fields< $ min_cols_no){
croak无效的文件格式。文件'$ file'在行'$ row_no'中没有'$ min_cols_no' ;
}
}

$ csv-> eof或$ csv-> error_diag();
close $ in_fh或
croak无法关闭$ file':$ OS_ERROR;

(4)使用Parse :: CSV
$ b

 使用Parse :: CSV; 
my $ simple = Parse :: CSV-> new(
file => $ file
);

my $ row_no = 0;
while(my $ array_ref = $ simple-> fetch){
++ $ row_no;
if(scalar @ $ array_ref< $ min_cols_no){
croak文件格式无效文件'$ file'在行'$ row_no'中没有'$ min_cols_no'列。
}
}



我使用Benchmark模块对它们进行基准化。 / p>

 使用Benchmark qw(timeit timestr timediff:hireswallclock); 

这些是数字(以秒为单位) / p>

1000行档案


:0.0016



实施2:0.0025



实施3:0.0050



实施4:0.0097


10,000行文件:
$ b


实施1:0.0204



实施2:0.0244


$ b b

实施3:0.0523



实施4:0.1050


1,500,000行文件


实施1:1.8697



实施2:3.1913



实施3:7.8475



实施4:15.6274


给出这些数字,我会得出结论,简单的解析器是最快的,但是从我从不同的来源读取,Text :: CSV_XS应该是最快的。



有人会告诉我这个吗?我使用模块有什么问题吗?非常感谢您的帮助!

解决方案

请注意, Text :: CSV_XS version比你的简单解析器版本更多。它拆分线,将其放入内存,并使您的hashref指向字段。



它也可能有其他逻辑,如允许转义分隔符(我不知道,因为我没有使用它)。除此之外,在使用模块时总是有少量的开销:函数调用,来回传递参数,以及可能不适用于您的情况的通用代码(例如,非常关心)。



通常,使用模块的好处远远超过成本。你得到更多的功能,更可靠的代码等。但是这可能不是真的一个小,很简单的任务。如果所有你需要做的是验证列数,使用模块可能是过度。您可以通过计算列数来更快地完成自己的实现,而不用担心拆分:

  / ?:,[^,] *){$ min_cols_no-1} /或croak没有找到最小列数; 

如果除了此验证步骤之外要进行实际处理,使用模块可能有益。


I am creating a subroutine that:

(1) Parses a CSV file;

(2) And checks if all the rows in that file have the expected number of columns. It croaks if the number of columns is invalid.

When the number of rows is ranging from thousands to millions, what do you think is the most efficient way to do it?

Right now, I'm trying out these implementations.

(1) Basic file parser

open my $in_fh, '<', $file or 
    croak "Cannot open '$file': $OS_ERROR";                                                            

my $row_no = 0;                                                                                           
while ( my $row = <$in_fh> ) {                                                                            
    my @values = split (q{,}, $row);                                                                      
    ++$row_no;                                                                                            
    if ( scalar @values < $min_cols_no ) {                                                                
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         

close $in_fh                                                                                              
    or croak "Cannot close '$file': $OS_ERROR";                                                           

(2) Using Text::CSV_XS (bind_columns and csv->getline)

my $csv = Text::CSV_XS->new () or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                 
open my $in_fh, '<', $file or                                                                             
   croak "Cannot open '$file': $OS_ERROR";                                                                

 my $row_no = 1;                                                                                          
 my @cols = @{$csv->getline($in_fh)};                                                                     
 my $row = {};                                                                                            
 $csv->bind_columns (\@{$row}{@cols});                                                                    
 while ($csv->getline ($in_fh)) {                                                                         
    ++$row_no;                                                                                            
    if ( scalar keys %$row < $min_cols_no ) {                                                             
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         

$csv->eof or $csv->error_diag();                                                                          
close $in_fh or
    croak "Cannot close '$file': $OS_ERROR";                                                           

(3) Using Text::CSV_XS (csv->parse)

my $csv = Text::CSV_XS->new() or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                
 open my $in_fh, '<', $file or                                                                           
   croak "Cannot open '$file': $OS_ERROR";                                                               

 my $row_no = 0;                                                                                         
 while ( <$in_fh> ) {                                                                                    
     $csv->parse($_);                                                                                    
     ++$row_no;                                                                                          
     if ( scalar $csv->fields < $min_cols_no ) {                                                         
       croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
     }                                                                                                   
}                                                                                                        

$csv->eof or $csv->error_diag();                                                                         
close $in_fh or 
    croak "Cannot close '$file': $OS_ERROR";                                                          

(4) Using Parse::CSV

use Parse::CSV;                                                                                           
my $simple = Parse::CSV->new(                                                                             
    file => $file                                                                                         
);                                                                                                        

my $row_no = 0;                                                                                           
while ( my $array_ref = $simple->fetch ) {                                                                
    ++$row_no;                                                                                            
    if ( scalar @$array_ref < $min_cols_no ) {                                                            
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         

I benchmark-ed them using the Benchmark module.

use Benchmark qw(timeit timestr timediff :hireswallclock);

And these are the numbers (in seconds) that I got:

1,000 lines of file:

Implementation 1: 0.0016

Implementation 2: 0.0025

Implementation 3: 0.0050

Implementation 4: 0.0097

10,000 lines of file:

Implementation 1: 0.0204

Implementation 2: 0.0244

Implementation 3: 0.0523

Implementation 4: 0.1050

1,500,000 lines of file:

Implementation 1: 1.8697

Implementation 2: 3.1913

Implementation 3: 7.8475

Implementation 4: 15.6274

Given these numbers, I would conclude that the simple parser is the fastest but from what I have read from different sources, Text::CSV_XS should be the fastest.

Will someone enlighten me on this? Is there something wrong with how I used the modules? Thanks a lot for your help!

解决方案

Note that your Text::CSV_XS version does more than your simple parser version. It splits the line, puts it into memory, and makes your hashref point to the fields.

It also may have other logic under the hood, like allowing escaped delimiters (I don't know, as I haven't used it). On top of that, there is always a small amount of overhead when using a module: function calls, passing parameters back and forth, and perhaps generic code that doesn't really apply in your case (such as error checking for things you don't care about).

Normally the benefits of using a module greatly outweigh the costs. You get more features, more reliable code, etc. But that might not be true with a small, very simple task. If all you need to do is verify the number of columns, using a module might be overkill. You could make your own implementation even faster by just counting the number of columns, and not bothering to split at all:

/(?:,[^,]*){$min_cols_no-1}/ or croak "Did not find minimum number of columns";

If you are going to do real processing in addition to this verification step, using the module will probably be beneficial.

这篇关于Perl中最快的CSV解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆