Python中的高性能模糊字符串比较,使用Levenshtein或difflib [英] High performance fuzzy string comparison in Python, use Levenshtein or difflib

查看:310
本文介绍了Python中的高性能模糊字符串比较,使用Levenshtein或difflib的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行临床消息归一化(拼写检查),其中,我对照900,000个单词的医学词典检查了每个给定的单词.我更担心时间的复杂性/性能.

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

我想进行模糊字符串比较,但是我不确定要使用哪个库.

I want to do fuzzy string comparison, but I'm not sure which library to use.

选项1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

选项2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result: 0.625

在此示例中,两者给出的答案相同.您认为在这种情况下两者的表现都一样吗?

In this example both give the same answer. Do you think both perform alike in this case?

推荐答案

如果您希望对Levenshtein和Difflib相似性进行快速的视觉比较,我计算了约230万本书的书名:

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

然后我用R绘制结果:

出于好奇,我还比较了Difflib,Levenshtein,Sørensen和Jaccard相似度值:

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

结果:

Difflib/Levenshtein的相似性确实很有趣.

The Difflib / Levenshtein similarity really is quite interesting.

2018如果您要识别相似的字符串,还可以签出minhashing,这是一个此处的概述很好. Minhashing在线性时间内在大型文本集合中发现相似之处非常了不起.我的实验室在这里组装了一个应用程序,该应用程序使用minhashing来检测和可视化文本重用: https://github.com/YaleDHLab/intertext

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

这篇关于Python中的高性能模糊字符串比较,使用Levenshtein或difflib的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆