根据同一列表中的下一个项目从列表中删除项目 [英] Remove item from list based on the next item in same list

查看:68
本文介绍了根据同一列表中的下一个项目从列表中删除项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始学习python,在这里我有一个排序的蛋白质序列列表(总共59,000个序列),其中有些重叠.例如,我在这里列出了一个玩具清单:

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH

我想删除那些较短的重叠部分,而只保留最长的重叠部分,这样所需的输出将如下所示:

I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH

我该怎么办?我的代码如下:

How can I do it? My code looks like this:

with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    for i in range(0, len(pattern)):
        if pattern[i] in pattern[i+1]:
            pattern.remove(pattern[i])
        print pattern

我收到错误消息:

['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    if pattern[i] in pattern[i+1]:
IndexError: list index out of range

推荐答案

还有其他可行的答案,但是没有一个可以解释您的实际问题.实际上,您真的很接近有效的解决方案,在我看来,这是最易读的答案.

There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer.

该错误源于您使用range()检查索引时对同一列表进行了突变.

The error came from the fact that you were mutating the same list while checking for index using range().

因此,在增加i变量的同时,您要从列表中删除项目,这在某一点上不可避免地导致index error.

Thus, while increasing the i variable you were removing item from the list which at one point causes the index error inevitably.

因此,这是您的初始代码的有效版本,并进行了一些更改,

Therefore, here is a working version of your initial code with some changes,

pattern = ["ABCDE","ABCDEFG","ABCDEFGH","ABCDEFGHIJKLMNO","CEST","DBTSFDE","DBTSFDEO","EOEUDNBNUW","EAEUDNBNUW","FG","FGH"]
output_pattern = []


for i in range(0, (len(pattern)-1)):
    if not pattern[i] in pattern[i+1]:
        output_pattern.append(pattern[i]) 

# Adding the last item
output_pattern.append(pattern[-1])   
print (output_pattern)

>>>> ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']    

请注意,如果您的列表先前按照注释部分中提到的顺序进行排序,则此代码将起作用.

Note that this code will work if your list is previously sorted as you mentioned in comment section.

这段代码在做什么?

基本上,它使用与初始答案相同的逻辑,在列表上进行迭代,并检查下一个项目是否包含当前项目.但是,使用另一个列表并迭代直到 beforely 项,将解决您的索引问题.但是现在出现了一个问题,

Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question,

我该如何处理最后一个项目?

由于列表已排序,因此您可以认为最后一项始终是唯一的.这就是为什么我使用

Since the list is sorted, you can consider the last item as always being unique. This is why I'm using

output_pattern.append(pattern[-1])

,它将添加初始列表的最后一项.

which adds the last item of the initial list.

重要提示

此答案是针对OP最初的问题而写的,他想保持更长的重叠时间,我根据同一列表中的下一项引用.如@Chris_Rands所述,如果您的关注与生物学任务有关,并且需要找到任何重叠之处,则此解决方案不适合您的需求.

This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs.

此代码无法识别潜在重叠的示例,

Example where this code would fail to recognize a potential overlap,

pattern = ["ACD", "AD", "BACD"]

,它将在不删除可能的"ACD"重叠的情况下输出相同的结果.现在,作为一个澄清,这意味着算法要复杂得多,我最初认为这超出了问题的要求范围.如果您遇到这种情况,我可能在这里完全错了,但我确实认为C ++实现似乎更合适.看看@Chris_Rands在评论部分中建议的CD-Hit算法.

where it would output the same result without removing the possible "ACD" overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.

这篇关于根据同一列表中的下一个项目从列表中删除项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆