在拉丁文脚本中匹配包含少于10个单词的两个字符串的最佳算法是什么 [英] What is the best algorithm for matching two string containing less than 10 words in latin script

查看:276
本文介绍了在拉丁文脚本中匹配包含少于10个单词的两个字符串的最佳算法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较歌曲标题,使用拉丁文字(虽然并非总是如此),我的目标是如果两首歌曲标题看起来是相同的标题并且如果他们什么都没有则分数非常低的话会给出高分的算法共同。

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.

现在我已经不得不使用Lucene和一个RAMDirectory来编写代码(Java) - 但是使用Lucene只是为了比较两个字符串太重,因此也是如此慢。我现在已经开始使用 https://github.com/nickmancol/simmetrics ,它有许多不错的算法用于比较两个字符串:

Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:

https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics

BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex

但我不熟悉这些算法,那会是一个不错的选择?

but I'm not well versed in these algorithms and what would be a good choice ?

我认为Lucene以某种形式使用CosineSimilarity,所以这是我的出发点,但我认为可能有更好的东西。

I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.

具体来说,算法应该适用于短字符串和应该理解单词的概念,即空格应该特别对待。拉丁文脚本的良好匹配是最重要的,但是韩文和中文等其他脚本的良好匹配也是相关的,但我希望因为它们处理空格的方式需要不同的算法。

Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

推荐答案

他们都很好。它们处理字符串的不同属性,并具有不同的匹配属性。什么最适合你取决于你需要什么。

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.

我正在使用JaccardSimilarity匹配名称。我之所以选择JaccardSimilarity,是因为它速度相当快,而且短小的字符串在匹配名称方面优于常见的拼写错误而快速降低其他任何分数。给空间增加额外的重量。它对词序也不敏感。我需要这种行为,因为假阳性的影响远远高于假阴性,空间可能是拼写错误,但不常见,而且词序并不那么重要。

I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.

请注意,这是与删除非变音符号的简化器和将剩余字符映射到a-z范围的映射器组合完成的。这通过标准化来标准化所有单词分隔符号到单个空格。最后,解析名称以选择首字母,前内部和后缀。这是因为名称具有对它们的结构和格式,而不仅仅是字符串比较。

Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.

要做出选择,您需要列出您想要的标准,然后寻找满足这些标准的算法。你也可以制作一个相当大的测试集并在该测试集上运行所有算法,看看在时间,正数,误报,漏报和否定,系统应该处理的错误类别方面的权衡取舍,等等,

To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.

如果您仍然不确定自己的选择,还可以设置系统以在运行时切换精确的比较算法。这允许您进行A-B测试,看看哪种算法在实践中效果最好。

If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.

TLDR;你想要哪种算法取决于你需要什么,如果你不知道你需要什么,请确保你以后可以改变它并在运行中运行测试。

TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

这篇关于在拉丁文脚本中匹配包含少于10个单词的两个字符串的最佳算法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆