bash,sed,awk删除带有重复ID和较旧日期的文本块 [英] bash, sed, awk remove block of text with a duplicate ID and an older date within the block
本文介绍了bash,sed,awk删除带有重复ID和较旧日期的文本块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想删除每个具有非唯一ID的块,但日期最新的块除外.
I would like to remove every block with a non-unique ID except for one that has the newest date.
我希望这些例子能说明一切.任何awk和/或sed解决方案将不胜感激!
I hope the examples are speaking for themselves. Any awk and/or sed solution would be appreciated!
原始文件:
<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>
<BLOCK>
Text
Text
ID=2000
DATE=20140101
Text
Text
</BLOCK>
<BLOCK>
ID=1000
DATE=20100101
Text
</BLOCK>
<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>
<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>
结果应如下所示:
<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>
<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>
<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>
谢谢您的帮助!
推荐答案
这将适用于任何系统上的任何awk:
This will work with any awk on any system:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
id = date = $0
gsub(/.*\nID=|\n.*/,"",id)
gsub(/.*\nDATE=|\n.*/,"",date)
}
date > dates[id] {
dates[id] = date
recs[id] = $0
}
END {
for (id in recs) {
print recs[id]
}
}
.
$ awk -f tst.awk file
<BLOCK>
ID=1000
Text
Text
DATE=20160101
Text
</BLOCK>
<BLOCK>
Text
Text
ID=2000
Text
DATE=20151231
</BLOCK>
<BLOCK>
Text
ID=3000
Text
Text
DATE=20160101
Text
</BLOCK>
您没有解释输出顺序应该是什么,并且在您的示例中它并不明显,因此我认为您不在乎,因此上述内容以随机"(实际上是哈希)顺序输出记录.
You don't explain what the output order should be and it's not obvious from your example so I assume you don't care and so the above outputs the records in "random" (actually hash) order.
这篇关于bash,sed,awk删除带有重复ID和较旧日期的文本块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文