除去这是其他线路串线 [英] Remove lines which are substrings of other lines

查看:163
本文介绍了除去这是其他线路串线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何删除它们的其他线路子在一个文件中,同时保持长字符串,其中包括他们的线?

How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?

我有一个包含肽序列字符串的文件 - 每行一个序列串。我想保持它包含所有序列的字符串,删除这是文件中的其他子行的所有行。

I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.

输入

GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG

预计输出

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

输出应该只保留最长的字符串,删除这是最长的串子行。因此,在上述的输入,线条1,2,4和5是第3行的子串,以便输出保留线3相若用于在线路6,8,9和10中的字符串的所有这些都是线7的串,从而线7被保留,并写入到输出

The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.

推荐答案

这应该做你想要什么:

$ cat tst.awk
{ arr[$0]; strs=strs $0 RS }
END {
    for (str in arr) {
        if ( split(strs,tmp,str) == 2 ) {
            print str
        }
    }
}

$ awk -f tst.awk file
IVPVNYARTTCRRTGGIRFTITGHDYFDN
GSAAQQYWTPANATFYGGSDASGT

它通过循环在ARR每一个字符串,然后把它作为对分裂()分隔值 - )出现串一次,然后将完整的文件内容将在半分割等分裂(将返回2,但是,如果字符串一些其他的字符串的一个子集,那么该文件的内容会被分成多个部分,因此分裂将返回比2一定数量更高。

It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.

如果一个字符串可以在输入出现多次,并且希望它在输出打印多次(见下方@ G.Cito在注释的问题),那么你会修改上面为:

If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from @G.Cito below) then you'd modify the above to:

!cnt[$0]++ { strs=strs $0 RS }
END {
    for (str in cnt) {
        if ( split(strs,tmp,str) == 2 ) {
            for (i=1;i<=cnt[str];i++) {
                print str
            }
        }
    }
}

这篇关于除去这是其他线路串线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆