从 Python 中的等长字符串计算相似性/差异矩阵 [英] Calculating a similarity/difference matrix from equal length strings in Python

查看：36 发布时间：2021/8/31 18:33:58 string python-3.x numpy matrix string-comparison

本文介绍了从 Python 中的等长字符串计算相似性/差异矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Python 中有成对的等长字符串和一个可接受的字母表.并非字符串中的所有字母都来自公认的字母表.例如

I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g.

str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'

我想要得到的是一个 numpy 矩阵，它描述了字符串之间的异同.IE.矩阵的列是 str1 中接受的字母表中的字母，行是 str2 中接受的字母表中的字母.条目是str1 和str2 包含相关字母的次数之和.我只关心两个字符串在给定位置都接受字母的情况.

What I want to get is a numpy matrix that describes the similarities and differences between the strings. I.e. columns of the matrix are letters from the accepted alphabet in str1 and rows are letters from the accepted alphabet in str2. The entries are the sum of the number of times that str1 and str2 contain the relevant letters. I only care about cases where both strings have accepted letters at a given position.

因此，对于上面的示例，我的输出将是(列是 str1，行是 str2，名称按字母顺序从左上角开始):

So, for the example above, my output would be (cols are str1, rows are str2, names are in alphabetical order starting top left):

# cols & rows both refer to 'A', 'C', 'G', 'T' starting top left
# cols are str1, rows are str2

array([[ 1,  1,  0,  1],
       [ 0,  0,  0,  0],
       [ 0,  0,  1,  0],
       [ 1,  0,  0,  0]])

我可以通过遍历所有可能的解决方案来强制执行此操作，但我想知道是否有人对更通用的解决方案有提示(或链接).例如.有没有办法更接近于我可以定义 N 个唯一字符的字母表，并在给定两个等长输入字符串的情况下得到 N×N 矩阵的方法?

I can brute force this by going through every possible pair of solutions, but I'd like to know if anyone has hints (or links) towards a more general solution. E.g. is there are a way to get closer to something where I can define an alphabet of N unique characters, and get out an N-by-N matrix given two equal-length input strings?

蛮力方法:

def matrix(s1,s2):
    m= np.zeros((4,4))
    for i in range(len(s1)):
        if s1[i]==s2[i]:
            if s1[i]=="A":
                m[0,0]=m[0,0]+1
            elif s1[i]=="C":
                m[1,1]=m[1,1]+1
            elif s1[i]=="G":
                m[2,2]=m[2,2]+1
            elif s1[i]=="T":
                m[3,3]=m[3,3]+1
        elif s1[i]=="A":
            if s2[i]=="C":
                m[1,0]=m[1,0]+1
            elif s2[i]=="G":
                m[2,0]=m[2,0]+1
            elif s2[i]=="T":
                m[3,0]=m[3,0]+1
        elif s1[i]=="C":
            if s2[i]=="A":
                m[0,1]=m[0,1]+1
            elif s2[i]=="G":
                m[2,1]=m[2,1]+1
            elif s2[i]=="T":
                m[3,1]=m[3,1]+1
        elif s1[i]=="G":
            if s2[i]=="A":
                m[0,2]=m[0,2]+1
            elif s2[i]=="C":
                m[1,2]=m[1,2]+1
            elif s2[i]=="T":
                m[3,2]=m[3,2]+1
        elif s1[i]=="T":
            if s2[i]=="C":
                m[1,3]=m[1,3]+1
            elif s2[i]=="G":
                m[2,3]=m[2,3]+1
            elif s2[i]=="A":
                m[0,3]=m[0,3]+1           
    return m

从 Python 中的等长字符串计算相似性/差异矩阵 [英] Calculating a similarity/difference matrix from equal length strings in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从 Python 中的等长字符串计算相似性/差异矩阵 [英] Calculating a similarity/difference matrix from equal length strings in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭