对制表符分隔文件中的每一列执行不同的正则表达式 [英] Perform a different regular expression for each column in a tab delimited file
问题描述
我发现自己在大约8年中第一次写PERL,但在一些本应轻松的事情上遇到了困难.这是基本前提:
I found myself writing PERL for the first time in about 8 years and I am having difficulties with something that should be easy. Here is the basic premise:
包含约100个字段的文件,其中10个字段的数据不正确(O为0)
A file containing a hundred or so fields 10 of which have incorrect data (the O's are 0's)
A B C D E F ...
br0wn red 1278076 0range "20 tr0ut" 123 ...
Green 0range 90876 Yell0w "18 Salm0n" 456 ...
我正在尝试编写用于拆分字段的程序,然后允许我在字段A上运行正则表达式以将O替换为0,但不将C列的替换为O,所以依此类推,我还有其他问题例如,可能为E列运行备用正则表达式.
I am trying to write the program to split the fields and then allow me to run a regex on field A to replace 0 with O but not replace 0 with O for column C and so on I have the additional problem of needing to possibly run an alternate regex for column E for instance.
我能够按/t分割记录中的所有字段.我在格式化命令以遍历每个字段并根据其所在的字段运行特定的正则表达式时遇到问题.
I was able to split all the fields in a record by the /t. I am having an issue formatting my command to go over each field and run a specific regex based on the field it is.
任何帮助将不胜感激,如果您解决了问题,我将向贝宝(Paypal)支付10美元购买您选择的饮料.
Any help would be appreciated and I will Paypal you 10 dollars for a beverage of your choice if you solve it.
推荐答案
Using a csv parser such as Text::CSV
is not complicated. Something like this might suffice:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
sep_char => "\t",
binary => 1,
eol => $/,
});
while (my $row = $csv->getline(*DATA)) {
tr/0/o/ for @{$row}[0, 1, 3]; # replace in cols A, B and D
s/(?<!\d)0(?!\d)/o/g for @{$row}[4]; # replace in col E
$csv->print(*STDOUT, $row); # print the result
}
__DATA__
A B C D E F
br0wn red 1278076 0range "20 tr0ut" 123
Green 0range 90876 Yell0w "18 Salm0n" 456
输出:
A B C D E F
brown red 1278076 orange "20 trout" 123
Green orange 90876 Yellow "18 Salmon" 456
请注意,我使用简单的正则表达式而不是音译(全局替换)来处理您的混合字符串(E列),并且它根本不会替换数字旁边的零,这对于某些数字将是失败的,例如 20.0
或 0
.
Note that I handled your mixed string (column E) with a simplistic regex instead of transliteration (global replace), and it simply does not replace zeroes which are next to numbers, which will fail for certain numbers, such as 20.0
or 0
.
更新:
如果您要基于列名称而不是位置进行替换,事情会变得更加复杂.但是, Text :: CSV
可以处理它.
If you want to do the substitutions based on column names instead of position, things get a bit more complicated. However, Text::CSV
can handle it.
use strict;
use warnings;
use Text::CSV;
my @pure_text = qw(A B D);
my @mixed = qw(E);
my $csv = Text::CSV->new({
sep_char => "\t",
binary => 1,
eol => $/,
});
my $cols = $csv->getline(*DATA); # read column names
$csv->print(*STDOUT, $cols);
$csv->column_names($cols); # set column names
while (my $row = $csv->getline_hr(*DATA)) { # hash ref instead of array ref
tr/0/o/ for @{$row}{@pure_text}; # substitution on hash slice
s/(?<!\d)0(?!\d)/o/g for @{$row}{@mixed};
my @row = @{$row}{@$cols}; # make temp array for printing
$csv->print(*STDOUT, \@row);
}
__DATA__
A B C D E F
br0wn red 1278076 0range "20 tr0ut" 123
Green 0range 90876 Yell0w "18 Salm0n" 456
此代码是独立的演示.要尝试对文件执行代码,请将 * DATA
更改为 * STDIN
,然后按如下所示使用脚本:
This code is a standalone for demonstration. To try the code on a file, change *DATA
to *STDIN
and use the script as follows:
perl script.pl < input.csv
这篇关于对制表符分隔文件中的每一列执行不同的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!