计算传入的字符流中单词的出现 [英] Count the occurrence of a word in an incoming stream of characters

查看:80
本文介绍了计算传入的字符流中单词的出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在采访中有人问我这个问题,虽然我在DS& Algo方面很出色,但是我无法解决这个问题。总之,这是一个有趣的问题。

I was asked this question during an interview and although I am good in DS&Algo but this one I was not able to solve. This is an interesting question anyways so posting it.

问题:您有一个字符输入流,需要计算一个单词的出现次数。您只能从流中读取一个API,即stream.next_char(),如果没有,则返回 \0。

Problem: You have an incoming stream of characters and you need to count the occurrences of a word. You have only one API to read from the stream which is stream.next_char() which return "\0" if there is none.

int count_occurrences(Stream stream, String word) {
// you have only one function provided from Stream class that you can use to 
// read one char at a time, no length/size etc.
// stream.next_char() - return "\0" if end
}


$返回 \0 b $ b

输入: aabckjhabcc
单词: abc
输出:2

Input: "aabckjhabcc" Word: "abc" Output: 2

推荐答案

我认为这个问题与
使用有限状态计算模型匹配字符串的想法有关。

I think this question is related to the idea of matching strings using a finite state model of computation.

可以通过使用KMP解决此问题字符串
匹配算法。

This question can be solved by using the KMP string matching algorithm.

KMP算法尝试通过考虑模式的前缀多少来找到模式
字符串的文本字符串中的匹配项即使在某个时候发现不匹配,
仍然会匹配。

The KMP algorithm tries to find matches in a text string of a pattern string, by considering how much prefix of pattern is still matched even if we find a mismatch at some point.

用于确定仍然可以匹配多少个前缀ed如果
在匹配模式中的索引i之前遇到不匹配,则
会预先构建一个故障函数。 (请参考下面的代码
来构建故障函数值)

For determining "how much prefix can still be matched" if we encounter a mismatch after matching upto index i in pattern, a failure function is built in advance. (Please refer the below code for building of failure function values)

此故障函数将为模式的每个索引i告诉
多少即使
在索引i之后遇到不匹配,该模式的前缀仍然可以匹配。

This failure function will tell for each index i of pattern, how much prefix of the pattern can still be matched even if we encounter a mismatch after index i.

这个想法是要弄清楚最长适当长度的长度是多少模式
的前缀,也是模式1到i
索引所表示的模式的每个子串的后缀,其中i的范围从1到n。

The idea is to figure out what is the length of longest proper prefix of pattern that is also a suffix of it for each substring of pattern denoted by 1 to i indices where i ranges from 1 to n.

我正在使用字符串索引从1开始。

I am using string indices to start from 1.

因此,任何模式
的第一个字符的失败函数值为0。(即,没有字符具有

Thus the failure function value for first character of any pattern is 0. (i.e. no character has been matched so far).

对于后续字符,对于每个索引i = 2到n,我们看到最长的$ b $的长度是
b模式[1 ... i]的子字符串的适当前缀,也是模式[1 ... i]的子字符串的
a后缀。

For subsequent characters, for each index i = 2 to n, we see what is the length of the longest proper prefix of the substring of pattern[1...i] which is also a suffix of that substring of pattern[1...i].

假设我们的模式是 aac,则失败函数va
索引1的lue为0(尚无匹配项),索引2的故障函数值
为1(最长适当前缀的长度与
最长适当后缀相同)对于 aa为1)

Suppose our pattern is "aac", then the failure function value for index 1 is 0 (no match yet), and the failure function value for index 2 is 1, ( the length of the longest proper prefix which is same as longest proper suffix for "aa" is 1)

对于模式 ababac,索引1的失败函数值为0,
索引2为0,索引3为1 (因为第三个索引'a'与
第一个索引'a'相同),对于索引4为2(因为在索引1和2处的 ab与在索引处的$ ab $ b与 ab相同) 3和4),并且对于索引5为3(索引[1 ... 3]处的 aba与索引[3 ... 5]处的 aba相同)。对于索引6,失败函数的值为0。

For pattern "ababac" the failure function value for index 1 is 0, index 2 is 0, index 3 is 1 (as the third index 'a' is the same as first index 'a'), for index 4 is 2 (as "ab" at indices 1 and 2 are same as "ab" at indices 3 and 4) , and for index 5 is 3 ("aba" at indices[1...3] is the same as "aba" at indices [3...5] ). For index 6, the failure function value is 0.

这是用于构建失败函数并使用以下命令匹配
a文本(或流)的代码(C ++)

Here is the code (C++) for building the failure function and matching a text (or stream) using it :

/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;  

/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;   

for (s = 1; s < n; s++) {
    while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
        t = f[t];
    }
    if (pattern[t + 1] == pattern[s + 1]) {
        t++;
        f[s + 1] = t;
    }
    else {
        f[s + 1] = 0;           
    }
}

/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();

/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0; 
while (current_char != '\0') {

    /* IF previously, we had matched upto a certain number of
       characters, and then failed, we return s to the point
       which is the longest prefix that still might be matched.

       (spaces between string are just for clarity)
       For e.g., if pattern is              "a  b  a  b  a  a" 
       & its failure returning index is     "0  0  1  2  3  1"

       and we encounter 
       the text like :      "x  y  z  a  b  a  b  a  b  a  a" 
              indices :      1  2  3  4  5  6  7  8  9  10 11

       after matching the substring "a  b  a  b  a", starting at
       index 4 of text, we are successful upto index 8  but we fail
       at index 9, the next character at index 9 of text is 'b'
       but in our pattern which should have been 'a'.Thus, the index
       in pattern which has been matched till now is 5 ( a  b  a  b  a)
                                                         1  2  3  4  5
       Now, we see that the failure returning index at index 5 of 
       pattern is 3, which means that the text is still matched upto
       index 3 of pattern (a  b  a), not from the initial starting 
       index 4 of text, but starting from index 6 of text.

       Thus we continue searching assuming that our next starting index
       in text is 6, eventually finding the match starting from index 6
       upto index 11.    

       */
        while (s > 0 && current_char != pattern[s + 1]) {
            s = f[s];
        }
        if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
                                                    matched portion of pattern with the current 
                                                    character of text , if it matches, we increase
                                                    the size of our matched portion by 1*/
        if (s == n) {
            count++;
        }
        current_char = stream.next_char();
}

printf("Count is %d\n", count);

`

注意:该方法将甚至在出现重叠模式时也有助于找出计数。例如,单词 aba在流 ababa中是两次出现

Note: This method will be helpful in finding out counts even in overlapping pattern occurrences. For example, the word "aba" as two occurrences in the stream "ababa".

这篇关于计算传入的字符流中单词的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆