使用相同字符查找最接近的字符串的算法 [英] algorithm to find closest string using same characters

查看:199
本文介绍了使用相同字符查找最接近的字符串的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出n个字符串的列表L和一个输入字符串S,在L中找到包含S中存在的最多字符的字符串的有效方法是什么?我们想在L中找到最接近由S中包含的字母组成的字符串.

一个明显的答案是循环遍历所有n个字符串,并检查S中当前字符串中有多少个字符.但是,此算法将频繁运行,并且n个字符串的列表L将存储在数据库中...手动遍历所有n个字符串将需要类似n * m ^ 2的big-Oh,其中n是L中字符串的数量,m是L中任何字符串的最大长度以及最大长度S的长度...在这种情况下,m实际上是一个常数150.

有没有比简单循环更好的方法了?是否有可以加载n个字符串的数据结构,可以使我快速搜索?是否存在一种算法,它使用有关n个字符串中的每个字符串的预先计算的元数据,其效果要比循环好?

我知道算法中有很多怪胎.所以请帮忙!

谢谢!

解决方案

如果在子字符串后面,则特里或帕特里卡特里可能是个不错的起点.

如果您不关心顺序,只关心每个符号或字母的数量,我将计算所有字符串的直方图,然后将它们与输入的直方图进行比较.

               ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...

如果仅考虑不区分大小写的拉丁字母,这会将成本降低到O(26 * m + n)加上一次预处理.

如果m为常数,则可以通过对其进行归一化来将直方图解释为26维单位球面上的26维向量.然后,您可以计算两个向量的点积,得出两个向量之间的夹角余弦向量,并且该值应与字符串的相似度成比例.

假设m = 3,仅大小为3的字母A = { 'U', 'V', 'W' }和以下字符串列表.

L = { "UUU", "UVW", "WUU" }

直方图如下.

H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }

使用r直方图h的欧几里得范数-r = sqrt(x² + y² + z²)将直方图h = (x, y, z)标准化为h' = (x/r, y/r, z/r).

H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }

输入S = "VVW"具有直方图hs = (0, 2, 1)和归一化的直方图hs' = (0.000, 0.894, 0.447).

现在我们可以将两个直方图h1 = (a, b, c)h2 = (x, y, z)的相似度计算为两个直方图的欧几里得距离.

d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)

我们得到的例子.

d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828

因此,"UVW"最接近"VVW"(数字越小,表示相似度越高).

使用归一化直方图h1' = (a', b', c')h2' = (x', y', z'),我们可以将距离计算为两个直方图的点积.

d'(h1', h2') = a'x' + b'y' + c'z'

我们得到的例子.

d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200

再次将"UVW"确定为最接近"VVW"(数字越大,表示相似度越高).

两个版本产生的数字不同,但结果始终相同.也可以使用其他范式-例如曼哈顿距离(L1 norm)-但这只会更改数字,因为有限维向量空间中的范式都是等效的.

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.

The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.

Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?

I know there are a lot of geeks out there that are into the algorithms. So please help!

Thanks!

解决方案

If you are after substrings, a Trie or Patrica trie might be a good starting point.

If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.

               ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...

This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.

If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.

Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.

L = { "UUU", "UVW", "WUU" }

The histograms are the following.

H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }

A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).

H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }

The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).

Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.

d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)

For the example we obtain.

d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828

Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).

Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.

d'(h1', h2') = a'x' + b'y' + c'z'

For the example we obtain.

d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200

Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).

Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.

这篇关于使用相同字符查找最接近的字符串的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆