字符串出现计数算法 [英] String Occurrence Counting Algorithm

查看:87
本文介绍了字符串出现计数算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇什么是最有效的算法(或最常用的算法)来计算一个文本块中字符串出现的次数。

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.

阅读,Boyer–Moore字符串搜索算法是字符串搜索的标准,但我不确定

From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.

在Python中,这就是我想要的:

In Python this is what I want:

text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.

编辑:好像python str.count 用作这种方法;但是,我无法找到它使用的算法。

It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.

推荐答案

对于初学者来说,是的,您可以使用Boyer-摩尔非常有效。但是,根据您问题的其他一些参数,可能会有更好的解决方案。

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.

Aho-Corasick字符串匹配算法 将在以下位置找到所有出现的 set 模式字符串目标字符串,并在时间O(m + n + z)中执行此操作,其中m是要搜索的字符串的长度,n是要匹配的所有模式的组合长度,z是所产生的匹配总数。如果只有一个字符串要匹配,则源字符串和目标字符串的大小是线性的。它还会发现相同字符串的重叠出现。此外,如果要检查一组字符串在某个源字符串中出现多少次,则只需调用一次该算法即可。最重要的是,如果要搜索的字符串集永不更改,则可以将O(n)用作预处理时间,然后在O(m + z)中查找所有匹配项。

The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).

另一方面,如果您要搜索一个源字符串和一组快速变化的子字符串,则可能需要使用 后缀树 。使用要搜索的字符串的O(m)预处理时间,可以以每个子字符串O(n)的时间检查字符串中长度为n的特定子字符串出现了多少次。

If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.

最后,如果您正在寻找可以轻松编写代码且麻烦最少的东西,则可能需要考虑研究 Rabin-Karp 算法,该算法使用旋转哈希函数来查找字符串。可以用大约十到十五行代码来编码,没有预处理时间,对于普通的文本字符串(很多匹配项很少的文本)可以很快找到所有匹配项。

Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.

希望这会有所帮助!

这篇关于字符串出现计数算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆