AWK中的动态正则表达式 [英] Dynamic regular expressions in awk
问题描述
我有类似的文本文件
1.txt
AA;00000;
BB;11111;
GG;22222;
2.txt
KK;WW;55555;11111;
KK;FF;ZZ;11111;
KK;RR;YY;11111;
我生成此 3.txt 输出
AA;00000;
BB;11111;KK;WW;55555;FF;ZZ;RR;YY
GG;22222;
使用此.awk脚本(我在Windows中将其与cmd一起使用)
with this .awk script (I use it in Windows with cmd)
#!/usr/bin/awk -f
NR != FNR {
exit
}
{
printf "%s", $0
}
/^BB/ {
o = ""
while (getline tmp < ARGV[2]) {
n = split (tmp,arr,";")
for (i=1; i<=n; i++)
if(!match($0,arr[i]) && !match(o,arr[i]))
o=o arr[i]";"
}
printf "%s", o
}
{
print ""
}
用法是 awk -f script.awk 1.txt 2.txt
似乎还可以,但请考虑这种情况
Seems to be ok but consider this situation
1.txt
AA;BB;
2.txt
CC;DD;BB;AA;
现在以这种方式替换
AA
替换为 d(2)
BB
替换为 http://a.o/f/i.p?t = 1
CC
被替换为 Link
DD
与 A_x-y.7z
AA
is replaced with d(2)
BB
is replaced with http://a.o/f/i.p?t=1
CC
is replaced with Link
DD
with A_x-y.7z
脚本无法生成 3.txt
AA;BB;CC;DD;
或者,如果使用替换的文本,则无法生成此3.txt文本输出
or, using replaced text it can't generate this 3.txt text output
d(2);http://a.o/f/i.p?t=1;Link;A_x-y.7z;
您会看到从3.txt输出中删除了 AA
, BB
之类的重复字段,因为脚本以这种方式工作.
You can see that duplicates fields like AA
, BB
are removed from 3.txt output because script works in that way.
我怀疑这与 match()
中的(...)
被作为REGEX分组有关,因为第一个参数是REGEX并通过传递 $ 0
和o都将被当作动态正则表达式*"(在 awk
说
I suspect it has to do with the (...)
being taken as a REGEX grouping in match()
as the first parameter is a REGEX and by passing $0
and o both will be treated as "Dynamic Regular Expressions* in awk
speak
推荐答案
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ key = $(NF-1) }
NR == FNR {
for (i=1; i<(NF-1); i++) {
if ( !seen[key,$i]++ ) {
map[key] = (key in map ? map[key] OFS : "") $i
}
}
next
}
{ print $0 map[key] }
$ awk -f tst.awk 2.txt 1.txt
AA;00000;
BB;11111;KK;WW;55555;FF;ZZ;RR;YY
GG;22222;
以上仅在数组索引的哈希查找中使用文字字符串,因此它不在乎输入中包含哪些字符.如果要将输入视为文字字符串,则不要使用正则表达式函数或运算符(例如 match()
,〜
, sub()
),只需使用字符串函数/运算符(例如 index()
, ==
, substr()
,).
The above just uses literal strings in a hash lookup of array indices so it doesn't care what characters you have in your input. If you want your input to be treated as literal strings then don't use regexp functions or operators (e.g. match()
, ~
, sub()
) on it, just use string functions/operators (e.g. index()
, ==
, substr()
, in
).
这篇关于AWK中的动态正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!