设计一个系统,以保持前k个频繁出现的词汇实时 [英] Design a system to keep top k frequent words real time

查看:144
本文介绍了设计一个系统,以保持前k个频繁出现的词汇实时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们希望有一个系统,以保持前k个高频词出现在最近一小时的鸣叫。如何设计呢?

Suppose we want a system to keep top k frequent words appear in tweets in last one hour. How to design it?

我能想出的HashMap,堆,记录或麻preduce但我不能找到一个非常有效的方法来做到这一点。

I can come up with hashmap, heap, log or MapReduce but I cannot find a very efficient way to do this.

其实这是在接受采访时的问题。
首先,我用一个HashMap来计算每个单词的频率。此外,我养了日志,因此随着时间的推移,我会倒计时的最古老的文字频率。
然后我不停的条目数组长度K(前k个阵列),一个数N是数组中最小的计数。
每当一个新词来了,我更新计数HashMap和得到这个新词的计数。如果是大于N,我会找到,如果这个词是到数组中。如果是,我更新的数组中的条目。如果没有,我删除数组中最小的条目,并插入该新字进去。 (更新Ñ相应的)

Actually it's a question in an interview.
First I used a hashmap to count the frequency of each word. Also I kept a log, so as time passing by, I could count down the oldest words frequency.
Then I kept an entry array with length K(Top K array) and a number N which is the smallest count number in the array.
Every time a new word comes, I update the counting hashmap and get the count number of this new word. If it's larger than N, I will find if this word is in the array. If it's, I update that entry in the array. If not, I delete the smallest entry in the array and insert this new word into it. (Update N accordingly)

这是问题所在,我的方法不能处理的删除。我可能需要遍历整个计数HashMap来寻找新的顶部K.
此外,作为所述访问者,系统应该得到非常快的结果。我想,几台机器一起工作,每台机器需要一些话。然而,如何结合的结果成为问题了。

Here is the problem, my approach cannot deal with the deleting. I may need to iterate the entire counting hashmap to find the new top K.
Also, as the interviewer said, the system should get the result very fast. I think of several machines work together and each machine takes some words. However, how to combine the results becomes a problem too.

推荐答案

如果词语不加权(除了权重0和1),那么有可能推导出简单的数据结构用于维持字计数顺序,使用O(N)辅助存储其中 N 是多少的在滑动窗口遇到(一小时,在这个例子中)唯一的的字样。所有操作(添加单词,到期一句话,查找最常说的一句话)可以在执行O(1)的时间。由于任何准确的解决方案,需要保留在滑动窗口中的所有独特的话,该解决方案是不是渐近差,虽然每个字的常数因子不小。

If the words are not weighted (other than weights 0 and 1), then it is possible to derive a simple datastructure which maintains the word counts in order, using O(N) auxiliary storage where N is the number of unique words encountered in the sliding window (one hour, in the example). All operations (add a word, expire a word, look up the most frequent word) can be performed in O(1) time. Since any accurate solution needs to retain all the unique words in the sliding window, this solution is not asymptotically worse although the constant factor per word is not small.

的关键的解决方案是,对于任何给定的单词的计数只能被递增或递减1,并且所有的计数是整数。因此,有可能保持的计数的双链表的(按顺序),其中在列表指向具有该计数字的双链表的每个节点。此外,在单词列表的每个节点指向回适当计数节点。最后,我们维持一个HashMap,使我们能够找到对应于给定词的节点。

The key to the solution is that the count for any given word can only be incremented or decremented by 1, and that all of the counts are integers. Consequently, it is possible to maintain a doubly-linked list of counts (in order) where each node in the list points to a double-linked list of words which have that count. In addition, each node in the word-list points back to the appropriate count node. Finally, we maintain a hashmap which allows us to find the node corresponding to a given word.

最后,为了腐烂的话,在他们生命的尽头,我们需要保持整个数据流的滑动窗口,其中有 O(N')其中 N'是滑动窗口中遇到的单词的总数。这可以存储为单链接列表,其中每个节点具有时间戳和一个指向在单词列表的唯一字。

Finally, in order to decay the words at the end of their life, we need to retain the entire datastream from the sliding window, which has a size of O(N') where N' is the total number of words encountered during the sliding window. This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list.

当一个字遇到或过期,其计数需要进行调整。因为计数只能以1递增或递减时,调整总是由在移动的单词的相邻的计数节点(其可能或可能不存在);因为计数节点被存储在一个有序链表,邻接节点可以发现或在时间中创建 O(1)。此外,最流行的词汇(和计数)总是可以从最大遍历合计表向后跟踪在固定时间内。

When a word is encountered or expired, its count needs to be adjusted. Since the count can only be incremented or decremented by 1, the adjustment always consists in moving the word to the adjacent count-node (which may or may not exist); since the count-nodes are stored in a sorted linked list, the adjacent node can be found or created in time O(1). Furthermore, the most popular words (and counts) can always be traced in constant time by traversing the count list backwards from the maximum.

在情况不是很明显,这里是数据结构的一个粗略的ASCII艺术绘画在给定时间点:

In case that was not obvious, here is a rough ascii art drawing of the datastructure at a given point in time:

Count list      word lists (each node points back to the count node)

  17            a <--> the <--> for
  ^
  |
  v
  12            Wilbur <--> drawing
  ^
  |
  v
  11            feature

现在,假设我们找到一个威尔伯。这将提高其数到13;我们可以从以下事实看出,成功的 12 不是 13 13 计数节点需要创建和插入计数列表。当我们做到这一点,我们删除威尔伯从当前的单词列表,把它与新的计点相关联的新创建的空单词列表,并改变数终场前在威尔伯指向新的计点。

Now, suppose we find a Wilbur. That will raise its count to 13; we can see from the fact that the success of 12 is not 13 that the 13 count node needs to be created and inserted into the count-list. After we do that, we remove Wilbur from its current word-list, put it into the newly-created empty word-list associated with the new count node, and change the count-pointer in Wilbur to point to the new count node.

然后,假设使用绘图到期,因此其新的计数将是11,我们可以从以下事实看出,<$ C的predecessor $ C> 12 11 ,没有新的计点需要创建;我们简单地删除绘图从单词列表并将其连接到单词列表关联与 11 ,固定计数终场前我们这样做了。现在,我们注意到单词列表 12 相关联的是空的,所以我们可以删除 12 从计算节点统计列表,并删除它。

Then, suppose that a use of drawing expires, so its new count will be 11. We can see from the fact that the predecessor of 12 is 11 that no new count node needs to be created; we simply remove drawing from its word-list and attach it to the word-list associate with 11, fixing its count-pointer as we do so. Now we notice that the word-list associated with 12 is empty, so we can remove the 12 count node from the count-list and delete it.

当一个字数为0,而不是将其连接到 0 算节点,该节点不存在,我们只是删除单词节点。如果一个新词遇到,我们只是加注到 1 算节点,创建计数节点,如果它不存在。

When the count for a word reaches 0, rather than attaching it to the 0 count node, which doesn't exist, we just delete the word node. And if a new word is encountered, we just add the word to the 1 count node, creating that count node if it doesn't exist.

在最坏的情况下,每一个字有一个独特的数,因此计数列表的大小不能大于唯一字的数量。此外,字表的总大小是完全相同的唯一字的数量,因为每一个字是在正好一个单词列表,充分过期字不都出现在字列表。

In the worst case, every word has a unique count, so the size of the count-list cannot be greater than the number of unique words. Also, the total size of the word-lists is exactly the number of unique words because every word is in exactly one word-list, and fully-expired words don't appear in the word-lists at all.

--- 修改

此算法有点RAM,饿了,但它确实不应该抱着一个小时的鸣叫任何麻烦。甚至一天的价值。和唯一的话是不会几天后很多改变,这个数字甚至考虑缩写和拼写错误。即便如此,这是值得我们思考的方式来减少内存占用和/或使并行算法。

This algorithm is a bit RAM-hungry but it really shouldn't have any troubles holding an hour's worth of tweets. Or even a day's worth. And the number of unique words is not going to change much after a few days, even considering abbreviations and misspellings. Even so, it's worth thinking about ways to reduce the memory footprint and/or make the algorithm parallel.

要减少内存占用,最简单的办法是只放话说这还是在几分钟后唯一的。这将极大地减少了独特的字数,而不会改变的流行词计数。事实上,你可以更大幅度地修剪了很多不改变最终的结果。

To reduce the memory footprint, the easiest thing is to just drop words which are still unique after a few minutes. This will dramatically cut down on the unique word count, without altering the counts of popular words. Indeed, you could prune a lot more drastically without altering the final result.

要运行算法并行,从个人的话可以通过使用散列函数,以产生一个机器号被分配给不同的机器。 (的不是的相同的哈希函数作为用于构造哈希表之一。)然后顶端 K 单词可以通过合并顶端结果 K 从每台机器的话;分配由哈希保证每台机器的词集显着。

To run the algorithm in parallel, individual words can be allocated to different machines by using a hash function to generate a machine number. (Not the same hash function as the one used to construct the hash tables.) Then the top k words can be found by merging the top k words from each machine; the allocation by hash guarantees that the set of words from each machine is distinct.

这篇关于设计一个系统,以保持前k个频繁出现的词汇实时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆