AWK中的动态正则表达式 [英] Dynamic regular expressions in awk

查看:59
本文介绍了AWK中的动态正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似的文本文件

1.txt

AA;00000;
BB;11111;
GG;22222;

2.txt

KK;WW;55555;11111;
KK;FF;ZZ;11111;
KK;RR;YY;11111;

我生成此 3.txt 输出

AA;00000;
BB;11111;KK;WW;55555;FF;ZZ;RR;YY
GG;22222;

使用此.awk脚本(我在Windows中将其与cmd一起使用)

with this .awk script (I use it in Windows with cmd)

#!/usr/bin/awk -f 

NR != FNR {
    exit
}
{
    printf "%s", $0
}
/^BB/ {
    o = ""
    while (getline tmp < ARGV[2]) {
        n = split (tmp,arr,";")
        for (i=1; i<=n; i++)
            if(!match($0,arr[i]) && !match(o,arr[i]))
                o=o arr[i]";"
    }
    printf "%s", o
}
{
    print ""
}

用法是 awk -f script.awk 1.txt 2.txt

似乎还可以,但请考虑这种情况

Seems to be ok but consider this situation

1.txt

AA;BB;

2.txt

CC;DD;BB;AA;

现在以这种方式替换

AA 替换为 d(2)
BB 替换为 http://a.o/f/i.p?t = 1
CC 被替换为 Link
DD A_x-y.7z

AA is replaced with d(2)
BB is replaced with http://a.o/f/i.p?t=1
CC is replaced with Link
DD with A_x-y.7z

脚本无法生成 3.txt

AA;BB;CC;DD;

或者,如果使用替换的文本,则无法生成此3.txt文本输出

or, using replaced text it can't generate this 3.txt text output

   d(2);http://a.o/f/i.p?t=1;Link;A_x-y.7z;

您会看到从3.txt输出中删除了 AA BB 之类的重复字段,因为脚本以这种方式工作.

You can see that duplicates fields like AA , BB are removed from 3.txt output because script works in that way.

我怀疑这与 match()中的(...)被作为REGEX分组有关,因为第一个参数是REGEX并通过传递 $ 0 和o都将被当作动态正则表达式*"(在 awk

I suspect it has to do with the (...) being taken as a REGEX grouping in match() as the first parameter is a REGEX and by passing $0 and o both will be treated as "Dynamic Regular Expressions* in awk speak

推荐答案

$ cat tst.awk
BEGIN { FS=OFS=";" }
{ key = $(NF-1) }
NR == FNR {
    for (i=1; i<(NF-1); i++) {
        if ( !seen[key,$i]++ ) {
            map[key] = (key in map ? map[key] OFS : "") $i
        }
    }
    next
}
{ print $0 map[key] }

$ awk -f tst.awk 2.txt 1.txt
AA;00000;
BB;11111;KK;WW;55555;FF;ZZ;RR;YY
GG;22222;

以上仅在数组索引的哈希查找中使用文字字符串,因此它不在乎输入中包含哪些字符.如果要将输入视为文字字符串,则不要使用正则表达式函数或运算符(例如 match() sub()),只需使用字符串函数/运算符(例如 index() == substr()).

The above just uses literal strings in a hash lookup of array indices so it doesn't care what characters you have in your input. If you want your input to be treated as literal strings then don't use regexp functions or operators (e.g. match(), ~, sub()) on it, just use string functions/operators (e.g. index(), ==, substr(), in).

这篇关于AWK中的动态正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆