如何在 Perl 中有效地解析 CSV 文件? [英] How do I efficiently parse a CSV file in Perl?

查看:45
本文介绍了如何在 Perl 中有效地解析 CSV 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,该项目涉及在 Perl 中解析大型 csv 格式的文件,并且希望提高效率.

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.

我的方法是先按行 split() 文件,然后 split() 每行再次用逗号来获取字段.但这是次优的,因为至少需要对数据进行两次传递.(一次按行分割,然后对每一行再次分割).这是一个非常大的文件,因此将处理量减半将是对整个应用程序的重大改进.

My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.

我的问题是,仅使用内置工具解析大型 CSV 文件的最省时方法是什么?

My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?

注意:每一行都有不同数量的标记,所以我们不能忽略行而只用逗号分隔.我们也可以假设字段将只包含字母数字 ascii 数据(没有特殊字符或其他技巧).另外,我不想进入并行处理,尽管它可能有效.

note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.

编辑

它只能涉及 Perl 5.8 附带的内置工具.由于官僚主义的原因,我不能使用任何第三方模块(即使托管在 cpan 上)

It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)

另一个编辑

假设我们的解决方案只允许在文件数据完全加载到内存后对其进行处理.

Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.

又一次编辑

我才明白这个问题有多愚蠢.很抱歉浪费您的时间.投票结束.

I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.

推荐答案

正确的做法是使用 文本::CSV_XS.与您可能自己做的任何事情相比,它会更快、更健壮.如果您决定只使用核心功能,那么根据速度与稳健性的不同,您有多种选择.

The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.

关于纯 Perl 的最快速度是逐行读取文件,然后天真地拆分数据:

About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:

my $file = 'somefile.csv';
my @data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]
";
while (my $line = <$fh>) {
    chomp $line;
    my @fields = split(/,/, $line);
    push @data, @fields;
}

如果任何字段包含嵌入的逗号,这将失败.更健壮(但更慢)的方法是使用 Text::ParseWords.为此,请将 split 替换为:

This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:

    my @fields = Text::ParseWords::parse_line(',', 0, $line);

这篇关于如何在 Perl 中有效地解析 CSV 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆