字符串比较在python但不是Levenshtein距离(我想) [英] string comparison in python but not Levenshtein distance (I think)

查看:165
本文介绍了字符串比较在python但不是Levenshtein距离(我想)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一篇文章中发现了一个粗字符串比较,如下所示:

I found a crude string comparison in a paper I am reading done as follows:

它们使用的方程如下(从小文字变化的纸上提取使它更一般和可读)
我尝试用我自己的话解释一点,因为作者的描述不是很清楚(作者使用一个例子)

The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)

例如,对于2个序列ABCDE和BCEFA,有两个可能的图形:

For example for 2 sequences ABCDE and BCEFA, there are two possible graphs

图形1)将B与BC连接,C与E连接

graph 1) which connects B with B C with C and E with E

图2)将A与A连接

我连接A和A时, (图1),因为那将是交叉线(想象你画BB,CC和EE之间的线);即划线A-A将穿过连接B-B,C-C和E-E的线。
所以这两个序列导致2个可能的图形;一个有3个连接(BB,CC和EE),另一个只有一个(AA),那么我通过下面的公式计算得分d。

I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E. So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.


因此,为了定义两个
五串之间的相似度,我们计算它们之间的距离d。对齐
两个五字符串,我们寻找他们的
字符之间的所有身份,无论这些字符在哪里。如果每个标识是
,由两个五元字符串之间的链接表示,我们为这个对定义一个图形
。我们称这个图的任何部分为配置。

Consequently, to define the degree of similarity between two penta-strings we calculate the distance d between them. Aligning the two penta-strings, we look for all the identities between their characters, wherever these may be located. If each identity is represented by a link between both penta-strings, we define a graph for this pair. We call any part of this graph a configuration.

接下来,我们保留所有这些配置,其中没有字符
交叉配对(意思在我的例子中解释,即没有交叉相同字符之间的链接,并且仅保留那些图形)。
然后将这些中的每一个作为与图相关的字符的
数p的函数,
对应对的移位Δi和$ b $的连接字符之间的间隙δij b每个五字符串。最小值被选择为特征,
被称为距离d:d Min(50-10p +ΣΔi+Σδij)虽然非常粗糙,但是
这个度量一般与定性的眼睛很一致
引导估计。例如, abcde abcfg
之间的距离为20,而 abcde abfcg 是23 =(50 - 30 + 1 +2)。

Next, we retain all of those configurations in which there is no character cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained). Each of these is then evaluated as a function of the number p of characters related to the graph, the shifting Δi for the corresponding pairs and the gap δij between connected characters of each penta-string. The minimum value is chosen as characteristic and is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).

我很困惑如何去做这个。任何建议,以帮助我将非常感激。

I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.

我尝试了Levenshtein和简单的序列比对用于蛋白质序列比较
本文链接:
http://peds.oxfordjournals.org/content/16/ 2 / 103.long

I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison The link to the paper is: http://peds.oxfordjournals.org/content/16/2/103.long

我找不到任何关于第一作者的信息,Alain Figau和我对MA Soto的电子邮件没有得到回答今天)。

I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).

谢谢

推荐答案

你引用,有一个参考来自同一作者的以前的论文:蛋白质的二级结构和三维模式识别。如果没有解释距离(我不在工作,所以我没有访问完整的文档),我认为值得研究它。

Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition. I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).

否则,你也可以尝试直接与作者联系:Alain Figau似乎是一个老式的法国研究员,没有任何联系(没有网页,没有电子邮件,没有社交网络,所以我建议尝试联系MA Soto,其电子邮件在本文结尾处给出。我认为他们会给你你要找的答案:实验的程序必须清楚,以便可重复,它是研究的科学方法的一部分。

Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting M.A. Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.

这篇关于字符串比较在python但不是Levenshtein距离(我想)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆