如何在python中具有相似分数的大字符串中找到相似的子字符串? [英] How to find a similar substring inside a large string with a similarity score in python?
问题描述
我要寻找的不仅仅是两个文本之间的简单相似性得分.但是字符串中子字符串的相似性得分.说:
What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:
text1 = 'cat is sleeping on the mat'.
text2 = 'The cat is sleeping on the red mat in the living room'.
在上面的示例中, text1
的所有单词完全存在于 text2
中,因此相似度应为100%.
In the above example, all the words of text1
are present in the text2
completely, hence the similarity should be 100%.
如果缺少 text1
的某些单词,则得分应更低.
If some words of text1
are missing, the score shall be less.
我正在处理一个具有不同段落大小的大型数据集,因此在具有此类相似度得分的较大段落中找到一个较小的段落至关重要.
I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.
我发现只有字符串相似度(例如余弦相似度,difflib相似度等)可比较两个字符串.但是与另一个字符串中的子字符串分数无关.
I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.
推荐答案
根据您的描述,如何:
>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
if word in b: #if it is in your bigger string increase score
score += 1
>>> score/len(a) #obtain percentage given total word number
1.0
例如,如果缺少单词:
>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
if w in b:
score +=1
>>> score/len(c)
0.875
此外,您可以按照@roadrunner的建议进行操作,并拆分 b
并将其另存为一组,以通过 b = set(b.split("))
.这样会将零件的复杂度降低到 O(1)
,并将整体算法提高到 O(n)
复杂度.
Additionally, you can do as @roadrunner suggest and split b
and save it as a set to speed up your performance with b = set(b.split(" "))
. This will reduce that part complexity to O(1)
and improve the overall algorithm to a O(n)
complexity.
您说您已经尝试了一些余弦相似度等度量标准.但是,我怀疑您可以通过查看
You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.
这篇关于如何在python中具有相似分数的大字符串中找到相似的子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!