在Python中聚类相似字符串的算法? [英] Algorithm to to Cluster Similar Strings in Python?

查看:384
本文介绍了在Python中聚类相似字符串的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个脚本,该脚本当前包含多个DNA序列列表(每个列表具有不同数量的DNA序列),我需要根据汉明距离相似性将每个列表中的序列聚类。我当前的实现方式(目前非常粗糙)提取列表中的第一个序列,并计算每个后续序列的汉明距离。如果它在一定的汉明距离之内,则将其附加到新列表中,该列表随后用于从原始列表中删除序列,并将相似序列存储在defaultdict中。请参阅下面我的代码的当前实现:

I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences from the original list as well as store the similar sequences in a defaultdict. See my current implementation of my code below:

def hamming_dist(sequence1, sequence2):
"""
Calculates the hamming distance between 2 sequences
"""
    assert len(sequence1) == len(sequence2)
    return sum(sequence1 !=sequence2 for sequence1,sequence2 \
    in itertools.izip(sequence1,sequence2))


def group_sequences(sequences_list):
    trash_sequences = []
    main_sequence = sequences_list[0]
    clustered_sequence = defaultdict(list)
    while len(sequences_list) > 1:
        for sequence in sequences_list:
            ham_dist = hamming_dist(main_sequence,sequence)
            if hamming_dist < 30:
                trash_sequences.append(sequence)

        for similar_sequences in trash_sequences:
            sequences_list.remove(similar_sequences)
            clustered_sequence[main_tcr].append(similar_sequences)
    else:
        clustered_sequence[main_sequence].append(None)
    return clustered_sequence

显然,这并不是最好的或最有效的方法,即使使用此实现,我的脚本中仍然遇到一些错误。我仔细阅读了StackOverflow / StackExchange的问题,以了解其他人是否遇到了我的问题以及发现的类似问题,还有许多其他人提到了有关使用算法的方法,例如K-Means算法,Markov聚类方法,层次结构聚类等。除了需要数字而不是字符串的K-means方法外,我对这些方法都不是很熟悉。

Obviously this isn't the best or most efficient way to do it and even with this implementation, there are still some bugs in my script that I encountered. I read through over StackOverflow/StackExchange questions to see if other people have encountered my problem and of the similar questions I found, many other people have mentioned about using algorithms such as the K-Means algorithm, Markov Clustering method, hierarchy clustering, etc. I'm not too familiar with any of these methods except the K-means method which requires numbers, not strings.

您建议我采用哪种聚类方法将相似的DNA序列聚类在一起,以及实现首选方法的最佳方法?

Which clustering method(s) would you suggest I implement to cluster similar DNA sequences together as well as the best way to implement your preferred method of choice? Any suggestions would be much appreciated!

推荐答案

入门的最佳选择是分层集群

它易于理解,可以任意距离,并且即使数据本身难以可视化,也可以将群集可视化为树。

It's easy to understand, it allows any distance, and it can visualize the clustering as a tree even when the data itself is hard to visualize.

这篇关于在Python中聚类相似字符串的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆