比较相似度算法 [英] Compare similarity algorithms

查看:188
本文介绍了比较相似度算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用字符串相似性函数在数据库中查找损坏的数据.

I want to use string similarity functions to find corrupted data in my database.

我遇到了其中几个:

  • Jaro,
  • Jaro-Winkler,
  • Levenshtein,
  • 欧几里得和
  • Q-gram

我想知道它们之间有什么区别,以及它们在什么情况下最有效?

I wanted to know what is the difference between them and in what situations they work best?

推荐答案

在勘误表和

Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.

从Wikipedia, Jaro-Winkler :

From Wikipedia, Jaro-Winkler:

在计算机科学和统计学中,Jaro–Winkler距离 (Winkler,1990)是衡量两个字符串之间相似性的标准.它是 Jaro距离度量的一种变体(Jaro,1989,1995),以及 主要[需要引用]用于记录链接领域(重复 检测).两根琴弦的Jaro–Winkler距离越高, 字符串越相似. Jaro–Winkler距离度量为 设计并最适合诸如人名之类的短字符串.这 分数被标准化,以使0等于无相似性,而1为 完全匹配.

In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly[citation needed] used in the area of record linkage (duplicate detection). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Levenshtein距离:

在信息论和计算机科学中,Levenshtein距离 是用于衡量两个之间的差异量的字符串量度 序列.术语编辑距离"通常用于专门指代 到Levenshtein的距离.

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

两个琴弦之间的Levenshtein距离定义为最小 将一个字符串转换为另一个字符串所需的编辑次数,其中 允许的编辑操作是插入,删除或 替换单个字符.它以弗拉基米尔·弗拉基米尔(Vladimir)命名 莱文施泰因(Levenshtein),他在1965年考虑了这一距离.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.

欧几里德距离:

在数学中,欧几里德距离或欧几里德度量是 两点之间的普通"距离,用一个点来测量 尺,并由毕达哥拉斯公式给出.通过使用这个公式 随着距离的增加,欧几里德空间(甚至任何内积空间)变为 公制空间.关联的规范称为欧几里得规范. 较早的文献将度量标准称为毕达哥拉斯度量标准.

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric.

Q或n-gram编码:

在计算语言学和概率领域,一个n-gram 是给定文本序列中n个项目的连续序列,或者 演讲.有问题的项目可以是音素,音节,字母, 单词或碱基对根据应用程序. n-克是 从文本或语音语料库中收集.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected from a text or speech corpus.

两个核心 n元语法模型的优点(以及使用 它们)是相对简单的,并且具有扩展的能力-只需 增加一个模型可以用来存储更多的上下文 易于理解的时空权衡,可以进行小型实验 非常有效地扩展.

The two core advantages of n-gram models (and algorithms that use them) are relative simplicity and the ability to scale up – by simply increasing n a model can be used to store more context with a well-understood space–time tradeoff, enabling small experiments to scale up very efficiently.

问题在于这些算法解决了所有可能解决算法数据中或嫁接可用的度量标准.实际上,并非所有这些指标甚至都是 metrics ,因为其中一些指标不能满足

The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.

正确地做到这一点:使用奇偶校验位作为数据.当更简单的解决方案可以解决时,不要尝试解决更困难的问题.

Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.

这篇关于比较相似度算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆