如何计算流中的唯一单词? [英] How to count unique words in a stream?
本文介绍了如何计算流中的唯一单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
是否有一种方法可以使用Flink Streaming计算流中唯一单词的数量?结果将是不断增长的数字流.
Is there a way to count the number of unique words in a stream with Flink Streaming? The results would be a stream of number which keeps increasing.
推荐答案
您可以通过存储所有已经看到的单词来解决该问题.有了这些知识,您就可以过滤掉所有重复的单词.剩下的然后可以由并行度为1
的映射运算符计算.以下代码段正是这样做的.
You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1
. The following code snippet does exactly that.
val env = StreamExecutionEnvironment.getExecutionEnvironment
val inputStream = env.fromElements("foo", "bar", "foobar", "bar", "barfoo", "foobar", "foo", "fo")
// filter words out which we have already seen
val uniqueWords = inputStream.keyBy(x => x).filterWithState{
(word, seenWordsState: Option[Set[String]]) => seenWordsState match {
case None => (true, Some(HashSet(word)))
case Some(seenWords) => (!seenWords.contains(word), Some(seenWords + word))
}
}
// count the number of incoming (first seen) words
val numberUniqueWords = uniqueWords.keyBy(x => 0).mapWithState{
(word, counterState: Option[Int]) =>
counterState match {
case None => (1, Some(1))
case Some(counter) => (counter + 1, Some(counter + 1))
}
}.setParallelism(1)
numberUniqueWords.print();
env.execute()
这篇关于如何计算流中的唯一单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文