如何在 Perl 中将文件的多行读入块中? [英] How can I read multiple lines of a file into blocks in Perl?
问题描述
我有一个包含以下文本的文件.
I have a file which contains the text below.
#L_ENTRY <s_slash_1>
#LEX </>
#ROOT </>
#POS <sp>
#SUBCAT <slash>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY <s_comma_1>
#LEX <,>
#ROOT <,>
#POS <sp>
#SUBCAT <comma>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY <s_tilde_1>
#LEX <~>
#ROOT <~>
#POS <sp>
#SUBCAT <tilde>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY <s_at_1>
#LEX <@>
#ROOT <@>
#POS <sp>
#SUBCAT <at>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
我知道如何使用 Perl 将这些行组成一个数组,但在这种情况下,我想创建一个包含两个元素的数组.每个以 #L_ENTRY
开头并以 #SYNONYM <0>
结尾.
I know how to make the lines into an array using Perl, but in this case I want to make an array with two elements. Each that begins with #L_ENTRY
and ends with #SYNONYM <0>
.
有人可以帮忙吗?
推荐答案
有两种方法可以做到.首先,您可以设置输入记录分隔符"特殊变量(查看更多这里).简而言之,您是在告诉 perl 一行不是由换行符终止的.在您的情况下,您可以将其设置为#SYNONYM <0>".然后,当您阅读一行时,您将获得具有该标签的文件中该点的所有内容 - 如果该标签不存在,那么您将获得文件中剩余的内容.所以,对于看起来像这样的输入数据;
There are two ways to do it. Firstly, you can set the "input record separator" special variable (see more here). In short, you are telling perl that a line is not terminated by a new-line char. In your case, you could set it to '#SYNONYM <0>'. Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file. So, for input data that looks like this;
#L_ENTRY <s_slash_1>
#LEX </>
#ROOT </>
#POS <sp>
#SUBCAT <slash>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY <s_comma_1>
#LEX <,>
#ROOT <,>
#POS <sp>
#SUBCAT <comma>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
如果你运行这个;
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
你得到;
#L_ENTRY <s_slash_1>
#LEX </>
#ROOT </>
#POS <sp>
#SUBCAT <slash>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
---
#L_ENTRY <s_comma_1>
#LEX <,>
#ROOT <,>
#POS <sp>
#SUBCAT <comma>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
关于此的一些注意事项;
A couple of notes about this;
- 记录之间的任何额外数据都将陷入网络"并最终出现在每条记录的开头;
- 记录分隔符本身仍然是数据的一部分,位于每条记录的末尾.
为了获得更多控制,最好逐行处理数据并使用正则表达式在捕获"模式和不捕获"模式之间切换:
To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;
my @chunks ;
my $chunk ;
my $capture_mode = 0 ;
while ( <$fh> ) {
$capture_mode = 1 if /$found_start_token/ ;
$chunk .= $_ if $capture_mode ;
if (/$found_stop_token/) {
push @chunks, $chunk ;
$chunk = '' ;
$capture_mode = 0 ;
}
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0
一些注意事项;
- 如果我们处于捕获模式,程序通过将当前行
$_
的字符串连接到$chunk
来工作. - 在扩展模式"下使用正则表达式关闭和打开捕获模式,
/x
.这允许向正则表达式添加空格以便于阅读. - 记录之间的额外数据不会出现在块中.
- 它产生与以前相同的输出.
- The program works by string concatenation of the current line,
$_
, on to$chunk
if we're in caputure mode. - Capture mode is turned off and on using regexs in 'extended mode',
/x
. This allows adding whitespace to the regex for easier reading. - Extra data between record will not appear in the chunks.
- It produces the same output as before.
这篇关于如何在 Perl 中将文件的多行读入块中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!