bash,sed,awk删除带有重复ID和较旧日期的文本块 [英] bash, sed, awk remove block of text with a duplicate ID and an older date within the block

查看:96
本文介绍了bash,sed,awk删除带有重复ID和较旧日期的文本块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除每个具有非唯一ID的块,但日期最新的块除外.

I would like to remove every block with a non-unique ID except for one that has the newest date.

我希望这些例子能说明一切.任何awk和/或sed解决方案将不胜感激!

I hope the examples are speaking for themselves. Any awk and/or sed solution would be appreciated!

原始文件:

<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>

<BLOCK>
Text
Text
ID=2000
DATE=20140101
Text
Text
</BLOCK>

<BLOCK>
ID=1000
DATE=20100101
Text
</BLOCK>

<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>

<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>

结果应如下所示:

<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>

<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>

<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>

谢谢您的帮助!

推荐答案

这将适用于任何系统上的任何awk:

This will work with any awk on any system:

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    id = date = $0
    gsub(/.*\nID=|\n.*/,"",id)
    gsub(/.*\nDATE=|\n.*/,"",date)
}
date > dates[id] {
    dates[id] = date
    recs[id] = $0
}
END {
    for (id in recs) {
        print recs[id]
    }
}

.

$ awk -f tst.awk file
<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>

<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>

<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>

您没有解释输出顺序应该是什么,并且在您的示例中它并不明显,因此我认为您不在乎,因此上述内容以随机"(实际上是哈希)顺序输出记录.

You don't explain what the output order should be and it's not obvious from your example so I assume you don't care and so the above outputs the records in "random" (actually hash) order.

这篇关于bash,sed,awk删除带有重复ID和较旧日期的文本块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆