算法找到字符串中的最常见的串 [英] Algorithm to find the most common substrings in a string

查看:198
本文介绍了算法找到字符串中的最常见的串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可用于确定字符串中的最常见的短语(或子)的任何算法?例如,下面的字符串将有世界你好为最常见的两个字母的词组:

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have "hello world" as its most common two-letter phrase:

你好世界,这是世界你好。你好世界在重复这个字符串三次!

在上面的弦,最常见的串(空字符串的字符,其中重复的无数次后)将是空格字符

In the string above, the most common string (after the empty string character, which repeats an infinite number of times) would be the space character .

有什么办法来产生这串常见的子字符串的列表,从最常见到最不常见?

Is there any way to generate a list of common substrings in this string, from most common to least common?

推荐答案

这是因为任务类似于Nussinov算法,实际上更简单的,因为我们不允许在对齐的任何缝隙,插入或错配。

This is as task similar to Nussinov algorithm and actually even simpler as we do not allow any gaps, insertions or mismatches in the alignment.

有关字符串是一个具有长度为N,定义了一个 F [-1。N,-1 .. N] 表,并填写使用以下规则:

For the string A having the length N, define a F[-1 .. N, -1 .. N] table and fill in using the following rules:

  for i = 0 to N
    for j = 0 to N
      if i != j
        {
          if A[i] == A[j]
            F[i,j] = F [i-1,j-1] + 1;
          else
            F[i,j] = 0;
        }

例如,对于 BA 0 BA B:

这运行在为O(n ^ 2)的时间。在表中的最大的值现在指向最长自我匹配subquences的端部位置(ⅰ - 另 - 酮occurence,j的末尾)。在开始时,该阵列被假定为零初始化。我已经加入条件,排除对角线是最长的,但可能不是有趣的自我匹配。

This runs in O(n^2) time. The largest values in the table now point to the end positions of the longest self-matching subquences (i - the end of one occurence, j - another). In the beginning, the array is assumed to be zero-initialized. I have added condition to exclude the diagonal that is the longest but probably not interesting self-match.

思考更多,这表是对称多对角线所以它足以计算仅一半。此外,该阵列是零初始化,以便分配零是多余的。这仍然

Thinking more, this table is symmetric over diagonal so it is enough to compute only half of it. Also, the array is zero initialized so assigning zero is redundant. That remains

  for i = 0 to N
    for j = i + 1 to N
      if A[i] == A[j]
         F[i,j] = F [i-1,j-1] + 1;

较短,但可能比较难理解。该计算表中包含的所有比赛中,短期和长期的。因为你需要,你可以添加更多的过滤。

Shorter but potentially more difficult to understand. The computed table contains all matches, short and long. You can add further filtering as you need.

在接下来的步骤中,您需要恢复的字符串,从非零细胞和左边对角线以下。在该步骤期间也是微不足道使用一些散列映射计数自相似匹配的数目为相同的字符串。随着正常的字符串和正常最小长度只有少数的表格单元格将通过这个地图进行处理。

On the next step, you need to recover strings, following from the non zero cells up and left by diagonal. During this step is also trivial to use some hashmap to count the number of self-similarity matches for the same string. With normal string and normal minimal length only small number of table cells will be processed through this map.

我觉得用HashMap的直接实际上需要为O(n ^ 3)为关键字符串在访问结束时,必须以某种方式进行相等比较。这种比较可能是为O(n)。

I think that using hashmap directly actually requires O(n^3) as the key strings at the end of access must be compared somehow for equality. This comparison is probably O(n).

这篇关于算法找到字符串中的最常见的串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆