每分钟/小时/天前100字的Twitter实时跟踪 [英] Realtime tracking of top 100 twitter words per min/hour/day

查看:124
本文介绍了每分钟/小时/天前100字的Twitter实时跟踪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近过这个面试问题就来了:

I recently came across this interview question:

Given a continuous twitter feed, design an algorithm to return the 100 most
frequent words used at this minute, this hour and this day.

我在想一个系统与的哈希映射字 - >计数链接到3分钟,堆当前分钟,小时,天。

I was thinking of a system with a hash map of word -> count linked to 3 min-heaps for the current min, hour and day.

每一个传入消息标记化,消毒和字数在哈希表更新(并增加键的堆,如果这个词已经存在的话)

Every incoming message is tokenized, sanitized and the word counts updated in the hash map (and increase-key in the heaps if the word already exists in it)

如果任何的字不堆存在(和堆大小== 100)检查,如果他们的频率>最小值在人堆里,如果是这样,然后提取分钟,并插入到堆。

If any of the words don't exist in the heap (and heap size == 100) check if their frequency > min value in the heap and if so then extract-min and insert into the heap.

是否有这样做的更好的办法?

Are there better ways of doing this?

推荐答案

您的算法是一个良好的开端,但它不会产生正确的结果。问题是,哈希表来描述他们的方式是单向的:一旦一个单词被添加,它永远保持计数

Your algorithm is a good start, but it is not going to produce correct results. The problem is that the hash tables the way you describe them are a one-way street: once a word gets added, it stays counted forever.

您需要的数组 1440 (24 * 60)字+计数哈希映射组织的方式,你描述;这些都是你分钟按一分钟计数。你需要两个额外的哈希映射 - 为滚动总小时和一天中的

You need an array of 1440 (24*60) word+count hash maps organized the way that you describe; these are your minute-by-minute counts. You need two additional hash maps - for the rolling total of the hour and the day.

定义两个操作上的哈希映射 - 添加减去,与语义合并相同的字计数,和去除的话,当他们的计数下降到零。

Define two operations on hash maps - add and subtract, with the semantic of merging counts of identical words, and removing words when their count drops to zero.

您开始一个新的哈希映射中的每分钟,并且从进料更新计数。在一分钟结束时,您将散列映射到阵列当前分钟,将其添加到滚动总的小时,一天,然后从小时运行总数减去一个小时前的哈希映射,并减去24小时前散列映射从日常运行总和。

Each minute you start a new hash map, and update counts from the feed. At the end of the minute, you place that hash map into the array for the current minute, add it to the rolling total for the hour and for the day, and then subtract the hash map of an hour ago from the hourly running total, and subtract the hash map of 24 hours ago from the daily running total.

最后,您需要一种方法来产生前100字赋予了哈希映射。这应该是一个简单的任务 - 将项目添加到字+计数项的数组,排序上的计数,并保持前100

Finally, you need a way to produce the top 100 words given a hash map. This should be a trivial task - add items to an array of word+count entries, sort on the count, and keep the top 100.

这篇关于每分钟/小时/天前100字的Twitter实时跟踪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆