从大的固定宽度文本中解析未排序的数据 [英] Parsing unsorted data from large fixed width text
问题描述
我主要是 Matlab 用户和 Perl n00b.这是我的第一个 Perl 脚本.
I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
我有一个很大的固定宽度数据文件,我想将它处理成一个带有目录的二进制文件.我的问题是数据文件非常大,数据参数按时间排序.这使得(至少对我而言)很难解析为 Matlab.所以看到 Matlab 不太擅长解析文本,我想我会尝试 Perl.我编写了以下代码,它至少在我的小测试文件上有效.然而,当我在一个实际的大数据文件上尝试它时,它非常缓慢.它是从 Web/Perl 文档中拼凑出各种任务的大量示例.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
这是数据文件的一个小样本.注意:真实文件大约有 2000 个参数,大小为 1-2GB.参数可以是文本、双精度或无符号整数.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
我的脚本的基本逻辑是:
The basic logic of my script is to:
- 阅读 ---- 行以构建要提取的参数列表(始终具有过滤器 =").
- 使用 --- 行来确定字段宽度.它被空格分隔.
- 对于每个参数构建时间和数据数组(嵌套在 foreach 参数中)
- 在
continue
块中写入时间和数据到二进制文件.然后在文本目录文件中记录名称、类型和偏移量(用于稍后将文件读入 Matlab).
- Read until the ---- line to build list of parameters to extract (always has "filter =").
- Use the --- line to determine field widths. It is broken by spaces.
- For each parameter build time and data array (while nested inside of foreach param)
- In
continue
block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
这是我的脚本:
#!/usr/bin/perl
$lineArg1 = @ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my @param_name;
my @template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push @param_name, $line;
}
elsif ($line =~ /^------/) {
@template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "@template";
my @lineData;
my @param_data;
my @param_time;
my $data_type;
foreach $current_param (@param_name) {
@param_time = ();
@param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
@lineData = unpack $template, $line;
push @param_time, @lineData[0];
push @param_data, @lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = @lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', @param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",@param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', @param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', @param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
所以我向网络上的好人提出的问题如下:
So my questions to you good people of the web are as follows:
- 我明显搞砸了什么?语法、在不需要时声明变量等.
- 这可能很慢(猜测),因为嵌套循环和一遍又一遍地逐行搜索.有没有更好的方法来重组循环以一次提取多行?
- 您可以提供任何其他速度提升技巧吗?
我修改了示例文本文件以说明非整数时间戳和参数名称可能包含空格.
I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.
推荐答案
我修改了我的代码以按照建议构建一个哈希.由于时间限制,我还没有将输出合并到二进制文件中.另外,我需要弄清楚如何引用散列以获取数据并将其打包为二进制文件.我不认为那部分应该很难......希望
I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
在实际数据文件(~350MB 和 200 万行)上,以下代码大约需要 3 分钟来构建哈希.CPU 使用率在我的 1 个内核上是 100%(在其他 3 个内核上为零),而 Perl 内存使用率最高约为 325MB ......直到它向提示符转储了数百万行.但是打印转储将被替换为二进制包.
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
如果我犯了任何新手错误,请告诉我.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my @param_names;
my @template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push @param_names, trim($line);
}
elsif ($line =~ /^----/) {
@template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my @getType = ((1) x $size);
my $template = "@template";
my @lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
@lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names);
if ($inHeader) {
push @{$dataHash{$lineData[1]}{time} }, $lineData[0];
push @{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (@param_names){
my $data = $dataHash{$param};
my @toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(@{$data->{time}});
push @toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(@{$data->{data}});
push @toc_line, tell $binfile;
print {$tocfile} join(',',@toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (@strings) = @_;
foreach my $string (@strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? @strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( @_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
输出如下:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
这是输出目录文件:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
感谢大家到目前为止的帮助!这是一个很好的资源!
Thanks everyone for their help so far! This is an excellent resource!
添加二进制 &TOC 文件编写代码.
Added Binary & TOC file writing code.
这篇关于从大的固定宽度文本中解析未排序的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!