从 Python 中的等长字符串计算相似性/差异矩阵 [英] Calculating a similarity/difference matrix from equal length strings in Python

查看:36
本文介绍了从 Python 中的等长字符串计算相似性/差异矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 中有成对的等长字符串和一个可接受的字母表.并非字符串中的所有字母都来自公认的字母表.例如

I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g.

str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'

我想要得到的是一个 numpy 矩阵,它描述了字符串之间的异同.IE.矩阵的列是 str1 中接受的字母表中的字母,行是 str2 中接受的字母表中的字母.条目是str1str2 包含相关字母的次数之和.我只关心两个字符串在给定位置都接受字母的情况.

What I want to get is a numpy matrix that describes the similarities and differences between the strings. I.e. columns of the matrix are letters from the accepted alphabet in str1 and rows are letters from the accepted alphabet in str2. The entries are the sum of the number of times that str1 and str2 contain the relevant letters. I only care about cases where both strings have accepted letters at a given position.

因此,对于上面的示例,我的输出将是(列是 str1,行是 str2,名称按字母顺序从左上角开始):

So, for the example above, my output would be (cols are str1, rows are str2, names are in alphabetical order starting top left):

# cols & rows both refer to 'A', 'C', 'G', 'T' starting top left
# cols are str1, rows are str2

array([[ 1,  1,  0,  1],
       [ 0,  0,  0,  0],
       [ 0,  0,  1,  0],
       [ 1,  0,  0,  0]])

我可以通过遍历所有可能的解决方案来强制执行此操作,但我想知道是否有人对更通用的解决方案有提示(或链接).例如.有没有办法更接近于我可以定义 N 个唯一字符的字母表,并在给定两个等长输入字符串的情况下得到 N×N 矩阵的方法?

I can brute force this by going through every possible pair of solutions, but I'd like to know if anyone has hints (or links) towards a more general solution. E.g. is there are a way to get closer to something where I can define an alphabet of N unique characters, and get out an N-by-N matrix given two equal-length input strings?

蛮力方法:

def matrix(s1,s2):
    m= np.zeros((4,4))
    for i in range(len(s1)):
        if s1[i]==s2[i]:
            if s1[i]=="A":
                m[0,0]=m[0,0]+1
            elif s1[i]=="C":
                m[1,1]=m[1,1]+1
            elif s1[i]=="G":
                m[2,2]=m[2,2]+1
            elif s1[i]=="T":
                m[3,3]=m[3,3]+1
        elif s1[i]=="A":
            if s2[i]=="C":
                m[1,0]=m[1,0]+1
            elif s2[i]=="G":
                m[2,0]=m[2,0]+1
            elif s2[i]=="T":
                m[3,0]=m[3,0]+1
        elif s1[i]=="C":
            if s2[i]=="A":
                m[0,1]=m[0,1]+1
            elif s2[i]=="G":
                m[2,1]=m[2,1]+1
            elif s2[i]=="T":
                m[3,1]=m[3,1]+1
        elif s1[i]=="G":
            if s2[i]=="A":
                m[0,2]=m[0,2]+1
            elif s2[i]=="C":
                m[1,2]=m[1,2]+1
            elif s2[i]=="T":
                m[3,2]=m[3,2]+1
        elif s1[i]=="T":
            if s2[i]=="C":
                m[1,3]=m[1,3]+1
            elif s2[i]=="G":
                m[2,3]=m[2,3]+1
            elif s2[i]=="A":
                m[0,3]=m[0,3]+1           
    return m

推荐答案

使用布尔矩阵的点积(保持顺序正确的最简单方法):

Using dot product of boolean matrices (easiest way to keep the order right):

def simMtx(a, x, y):
    a = np.array(list(a))
    x = np.array(list(x))
    y = np.array(list(y))
    ax = (x[:, None] == a[None, :]).astype(int)
    ay = (y[:, None] == a[None, :]).astype(int)
    return np.dot(ay.T, ax)

simMtx(alphabet, str1, str2)
Out[183]: 
array([[1, 1, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]])

这篇关于从 Python 中的等长字符串计算相似性/差异矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆