Perl程序可有效处理目录中的500,000个小文件 [英] Perl Program to efficiently process 500,000 small files in a directory

查看:84
本文介绍了Perl程序可有效处理目录中的500,000个小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每天晚上都在处理一个大目录。每晚它累积约100万个文件,其中一半是 .txt 文件,我需要根据它们的内容将其移至其他目录。

I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt files that I need to move to a different directory according to their contents.

每个 .txt 文件都是用管道分隔的,仅包含20条记录。记录6是包含我需要确定将文件移动到哪个目录的信息的记录。

Each .txt file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.

示例记录:

A|CHNL_ID|4

在这种情况下该文件将被移至 / out / 4

In this case the file would be moved to /out/4.

此脚本的处理速度为每个文件80,000个文件

This script is processing at a rate of 80,000 files per hour.

是否有关于如何加快速度的建议?

Are there any recommendations on how I could speed this up?

opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
    next if( $txtFile !~ /.txt$/ );
    $cnt++;

    local $/;
    open my $fh, '<', $txtFile or die $!, $/;
    my $data  = <$fh>;
    my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;
    close($fh);

    move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);


推荐答案

尝试类似的操作:

print localtime()."\n";                          #to find where time is spent
opendir(DIR, $dir) or die "$!\n";
my @txtFiles = map "$dir/$_", grep /\.txt$/, readdir DIR;
closedir(DIR);

print localtime()."\n";
my %fileGroup;
for my $txtFile (@txtFiles){
    # local $/ = "\n";                           #\n or other record separator
    open my $fh, '<', $txtFile or die $!;
    local $_ = join("", map {<$fh>} 1..6);      #read 6 records, not whole file
    close($fh);
    push @{ $fileGroup{$1} }, $txtFile
      if /A\|CHNL_ID\|(\d+)/i or die "No channel found in $_";
}

for my $channel (sort keys %fileGroup){
  moveGroup( @{ $fileGroup{$channel} }, "$outDir/$channel" );
}
print localtime()." finito\n";

sub moveGroup {
  my $dir=pop@_;
  print localtime()." <- start $dir\n";
  move($_, $dir) for @_;  #or something else if each move spawns sub process
  #rename($_,$dir) for @_;
}

这将工作分为三个主要部分,您可以在其中计时每个部分花费最多时间的地方。

This splits the job into three main parts where you can time each part to find where most time is spent.

这篇关于Perl程序可有效处理目录中的500,000个小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆