在大词序列中查找前 K 个频繁词的最有效方法 [英] The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

查看:22
本文介绍了在大词序列中查找前 K 个频繁词的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

输入:一个正整数 K 和一个大文本.文本实际上可以被视为单词序列.这样我们就不用担心如何分解成词序了.
输出:文本中出现频率最高的 K 个词.

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

我的想法是这样的.

  1. 在遍历整个词序列的同时,使用哈希表记录所有词的出现频率.在这个阶段,key是word",value是word-frequency".这需要 O(n) 时间.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

对(词,词频)对进行排序;关键是词频".对于普通排序算法,这需要 O(n*lg(n)) 时间.

sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

排序后,我们只取前K个单词.这需要 O(K) 时间.

After sorting, we just take the first K words. This takes O(K) time.

总结一下,总时间是O(n+nlg(n)+K),因为K肯定比N小,所以实际上是O(nlg(n)).

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

我们可以改进这一点.实际上,我们只想要top K词.其他词的频率与我们无关.所以,我们可以使用部分堆排序".对于步骤 2) 和 3),我们不只是进行排序.相反,我们将其更改为

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') 以词频"为关键字构建(词,词频)对的堆.构建一个堆需要 O(n) 时间;

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') 从堆中提取前 K 个单词.每次提取都是 O(lg(n)).所以,总时间是 O(k*lg(n)).

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

总而言之,这个解决方案花费时间 O(n+k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

这只是我的想法.我还没有找到改进步骤 1 的方法).
我希望一些信息检索专家可以对这个问题有更多的了解.

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

推荐答案

这可以在 O(n) 时间内完成

This can be done in O(n) time

解决方案 1:

步骤:

  1. 计算单词并对其进行哈希处理,最终会得到这样的结构

  1. Count words and hash it, which will end up in the structure like this

var hash = {
  "I" : 13,
  "like" : 3,
  "meow" : 3,
  "geek" : 3,
  "burger" : 2,
  "cat" : 1,
  "foo" : 100,
  ...
  ...

  • 遍历哈希并找到最常用的单词(在本例中为foo"100),然后创建该大小的数组

  • Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

    然后我们可以再次遍历hash,并使用单词出现的次数作为数组索引,如果索引中没有任何内容,则创建一个数组,否则将其追加到数组中.然后我们得到一个像这样的数组:

    Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                  100
    [[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
    

  • 然后从末尾开始遍历数组,收集k个单词.

  • Then just traverse the array from the end, and collect the k words.

    解决方案 2:

    步骤:

    1. 同上
    2. 使用最小堆并将最小堆的大小保持为 k,对于散列中的每个单词,我们将单词的出现次数与最小值进行比较,1) 如果它大于最小值,则删除最小值(如果大小的最小堆等于 k) 并将数字插入最小堆.2) 休息简单的条件.
    3. 遍历数组后,我们只需将最小堆转换为数组并返回数组.

    这篇关于在大词序列中查找前 K 个频繁词的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆