bash,awk,sed删除具有重复ID的XML块,保持最新,保持原始顺序 [英] bash, awk, sed remove XML blocks with duplicate IDs, keep most up-to-date, keep original order

查看:69
本文介绍了bash,awk,sed删除具有重复ID的XML块,保持最新,保持原始顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于此脚本任务的任何帮助,我将不胜感激.

I would appreciate any help with this scripting task.

我需要删除每个具有非唯一ID的块,但具有最新日期的块除外.如果日期相等,则文件中的最后一个条目将获胜,并保持不删除状态.

I need to remove every block with a non-unique ID except for one that has the newest date. If the dates are equal, than the last entry within the file should win and be kept undeleted.

必须保留输入的原始排序顺序.

The original sorting order of the input has to be preserved.

输入:

<DATA>
<TABLES>

<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="4444"/>
<DATE V="20140101 00:00:00"/>
<TEXT/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20100101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="22"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

</TABLES>
</DATA>

预期输出:

<DATA>
<TABLES>

<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

</TABLES>
</DATA>

推荐答案

从您的问题尚不清楚,正如我在您的问题下的评论中所提到的,您想要的输出顺序是什么,但这是一种解释-它将遍历记录按照它们出现在输入文件中的顺序,仅当它是文件中包含id的最大日期的最后一条记录时,才打印每条记录.它可以在任何UNIX系统上的任何awk中运行.

It's not entirely clear from your question what output order you want as mentioned in my comment under your question, but this is one interpretation - it will loop through the records in the order they appeared in the input file and print each record only if it were the last one in the file that contained the max date for an id. It will work in any awk on any UNIX system.

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    id = date = $0
    gsub(/.*\n<ID V="|".*/,"",id)
    gsub(/.*\n<DATE V="|".*/,"",date)
}

date >= id2maxDate[id] {
    delete maxDateRecNr2rec[id2maxDateRecNr[id]]
    id2maxDateRecNr[id]  = NR
    maxDateRecNr2rec[NR] = $0
    id2maxDate[id]       = date
}

END {
    for (recNr=1; recNr<=NR; recNr++) {
        if ( recNr in maxDateRecNr2rec ) {
            print maxDateRecNr2rec[recNr]
        }
    }
}

.

$ awk -f tst.awk file
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

您在问题中说了date,但我假设您的意思是您输入的DATE字段中的内容是真的,所以,对于您发布的示例而言,这无关紧要,因为所有时间都是午夜,但上面使用的是日期+时间,即DATE字段的全部内容.如果您希望将一天中的时间从计算中排除,则只需更改:

You say date in your question but I'm assuming by that you really mean whatever is in the DATE field of your input so, it doesn't matter for the example you posted since all the times are midnight, but the above uses the date+time, i.e. the entire contents of the DATE field. If you want the time of day to be excluded from the calculations then just change:

    gsub(/.*\n<DATE V="|".*/,"",date)

    gsub(/.*\n<DATE V="| .*/,"",date)

这篇关于bash,awk,sed删除具有重复ID的XML块,保持最新,保持原始顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆