从大的固定宽度文本中解析未排序的数据 [英] Parsing unsorted data from large fixed width text

查看:47
本文介绍了从大的固定宽度文本中解析未排序的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我主要是 Matlab 用户和 Perl n00b.这是我的第一个 Perl 脚本.

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

我有一个很大的固定宽度数据文件,我想将它处理成一个带有目录的二进制文件.我的问题是数据文件非常大,数据参数按时间排序.这使得(至少对我而言)很难解析为 Matlab.所以看到 Matlab 不太擅长解析文本,我想我会尝试 Perl.我编写了以下代码,它至少在我的小测试文件上有效.然而,当我在一个实际的大数据文件上尝试它时,它非常缓慢.它是从 Web/Perl 文档中拼凑出各种任务的大量示例.

I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.

这是数据文件的一个小样本.注意:真实文件大约有 2000 个参数,大小为 1-2GB.参数可以是文本、双精度或无符号整数.

Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.

Param 1   filter = ALL_VALUES
Param 2   filter = ALL_VALUES
Param 3   filter = ALL_VALUES

Time                     Name     Ty  Value                   
---------- ---------------------- --- ------------
1.1        Param 1                UI  5           
2.23       Param 3                TXT Some Text 1 
3.2        Param 1                UI  10          
4.5        Param 2                D   2.1234     
5.3        Param 1                UI  15         
6.121      Param 2                D   3.1234     
7.56       Param 3                TXT Some Text 2 

我的脚本的基本逻辑是:

The basic logic of my script is to:

  1. 阅读 ---- 行以构建要提取的参数列表(始终具有过滤器 =").
  2. 使用 --- 行来确定字段宽度.它被空格分隔.
  3. 对于每个参数构建时间和数据数组(嵌套在 foreach 参数中)
  4. continue 块中写入时间和数据到二进制文件.然后在文本目录文件中记录名称、类型和偏移量(用于稍后将文件读入 Matlab).
  1. Read until the ---- line to build list of parameters to extract (always has "filter =").
  2. Use the --- line to determine field widths. It is broken by spaces.
  3. For each parameter build time and data array (while nested inside of foreach param)
  4. In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).

这是我的脚本:

#!/usr/bin/perl

$lineArg1 = @ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";

my $line;
my $data_start_pos;
my @param_name;
my @template;
while ($line = <INFILE>) {
    chomp $line;
    if ($line =~ s/\s+filter = ALL_VALUES//) {
       $line = =~ s/^\s+//;
       $line =~ s/\s+$//;
       push @param_name, $line;
    }
    elsif ($line =~ /^------/) {
        @template = map {'A'.length} $line =~ /(\S+\s*)/g;
        $template[-1] = 'A*';        
        $data_start_pos = tell INFILE;
        last; #Reached start of data exit loop
    }
}
my $template = "@template";
my @lineData;
my @param_data;
my @param_time;
my $data_type;
foreach $current_param (@param_name) {
    @param_time = ();
    @param_data = ();    
    seek(INFILE,$data_start_pos,0); #Jump to data start
    while ($line = <INFILE>) {
        if($line =~ /$current_param/) {      
           chomp($line);
           @lineData = unpack $template, $line;
           push @param_time, @lineData[0];   
           push @param_data, @lineData[3];
        }       
    } # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
        $data_type = @lineData[2];
        print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time        
        print BINOUT pack('d*', @param_time);  #Write TimeStamps
        print TOCOUT tell(BINOUT).","; #offset to end of time/data start
        if ($data_type eq "TXT") {
            print BINOUT pack 'A*', join("\n",@param_data);
        }
        elsif ($data_type eq "D") {
            print BINOUT pack('d*', @param_data);
        }
        elsif ($data_type eq "UI") {
            print BINOUT pack('L*', @param_data);
        }        
        print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);

所以我向网络上的好人提出的问题如下:

So my questions to you good people of the web are as follows:

  1. 我明显搞砸了什么?语法、在不需要时声明变量等.
  2. 这可能很慢(猜测),因为嵌套循环和一遍又一遍地逐行搜索.有没有更好的方法来重组循环以一次提取多行?
  3. 您可以提供任何其他速度提升技巧吗?

我修改了示例文本文件以说明非整数时间戳和参数名称可能包含空格.

I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.

推荐答案

我修改了我的代码以按照建议构建一个哈希.由于时间限制,我还没有将输出合并到二进制文件中.另外,我需要弄清楚如何引用散列以获取数据并将其打包为二进制文件.我不认为那部分应该很难......希望

I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully

在实际数据文件(~350MB 和 200 万行)上,以下代码大约需要 3 分钟来构建哈希.CPU 使用率在我的 1 个内核上是 100%(在其他 3 个内核上为零),而 Perl 内存使用率最高约为 325MB ......直到它向提示符转储了数百万行.但是打印转储将被替换为二进制包.

On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.

如果我犯了任何新手错误,请告诉我.

Please let me know if I am making any rookie mistakes.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);

my $line;
my @param_names;
my @template;
while ($line = <INFILE>) {
    chomp $line; #Remove New Line
    if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
       push @param_names, trim($line);
    }
    elsif ($line =~ /^----/) {
        @template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
        $template[-1] = 'A*';
        my $data_start_pos = tell INFILE;
        last; #Reached start of data exit loop
    }
}

my $size = $#param_names+1;
my @getType = ((1) x $size);
my $template = "@template";
my @lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
    if ($lineCount % 100000 == 0){
        print "On Line: ".$lineCount."\n";
    }
    if ($line =~ /^\d/) { 
        chomp($line);
        @lineData = unpack $template, $line;
        my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names);
        if ($inHeader) { 
            push @{$dataHash{$lineData[1]}{time} }, $lineData[0];
            push @{$dataHash{$lineData[1]}{data} }, $lineData[3];
            if ($getType[$headerIndex]){ # Things that only need written once
                $dataHash{$lineData[1]}{type}  = $lineData[2];
                $getType[$headerIndex] = 0;
            }
        }
    }  
$lineCount ++; 
} # END WHILE <INFILE>
close(INFILE);

print Dumper \%dataHash;

#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_});

open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';

for my $param (@param_names){
    my $data = $dataHash{$param};
    my @toc_line = ($param, $data->{type}, tell $binfile );
    print {$binfile} $convert{D}->(@{$data->{time}});
    push @toc_line, tell $binfile;
    print {$binfile} $convert{$data->{type}}->(@{$data->{data}});
    push @toc_line, tell $binfile;
    print {$tocfile} join(',',@toc_line,''),"\n";
}

sub trim { #Trim leading and trailing white space
  my (@strings) = @_;
  foreach my $string (@strings) {
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    chomp ($string);
  } 
  return wantarray ? @strings : $strings[0];
} # END SUB

sub findStr { #Return TRUE if string is contained in array.
    my $searchStr = shift;
    my $i = 0;
    foreach ( @_ ) {
        if ($_ eq $searchStr){
            return (1,$i);
        }
    $i ++;
    }
    return (0,-1);
} # END SUB

输出如下:

$VAR1 = {
          'Param 1' => {
                         'time' => [
                                     '1.1',
                                     '3.2',
                                     '5.3'
                                   ],
                         'type' => 'UI',
                         'data' => [
                                     '5',
                                     '10',
                                     '15'
                                   ]
                       },
          'Param 2' => {
                         'time' => [
                                     '4.5',
                                     '6.121'
                                   ],
                         'type' => 'D',
                         'data' => [
                                     '2.1234',
                                     '3.1234'
                                   ]
                       },
          'Param 3' => {
                         'time' => [
                                     '2.23',
                                     '7.56'
                                   ],
                         'type' => 'TXT',
                         'data' => [
                                     'Some Text 1',
                                     'Some Text 2'
                                   ]
                       }
        };

这是输出目录文件:

Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,

感谢大家到目前为止的帮助!这是一个很好的资源!

Thanks everyone for their help so far! This is an excellent resource!

添加二进制 &TOC 文件编写代码.

Added Binary & TOC file writing code.

这篇关于从大的固定宽度文本中解析未排序的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆