找出一本大书中最常用的 10 个单词 [英] Find the 10 most frequently used words in a large book

查看:19
本文介绍了找出一本大书中最常用的 10 个单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这已经在论坛上被问过几次,我没有找到任何可以被认为是最合适的解决方案的标记"答案 - 所以再次询问:

I am aware that this has been asked on the forum a couple of times, I did not find any 'TAGGED' answer which could be considered the most appropriate soluion - so asking again:

我们从书中得到了一个非常大的文本,所有这些都无法放入记忆中.我们需要找到文本中出现频率最高的 10 个单词.执行此操作的最佳(时间和空间)方式是什么?

We are given a very large text from book all of which cannot fit into the memory. We are required to find the top 10 most frequently occuring words in the text. What would be the most optimal (time and space) way to do this?

我的想法:

将文件分成 k 个大小的块(这样每个块都可以存储在内存中).现在,对每个块执行外部排序.一旦我们在磁盘上有 (N/k)- 排序的文件(假设 N 是书中文本的总大小) - 我不知道应该如何继续,以便我可以从k 排序数组.

Divide the file into k-sizeable chunks (such that each of the chunks can be stored in the memory). Now, perform an external sort on each of the chunks. Once we have the (N/k)- sorted files on the disk (assuming N is the total size of the text from the book) - I am not sure how should I continue so that I can obtain the top 10-elements from the k-sorted arrays.

另外,如果有不同的思路,请提出建议.

Also, if there is a different line of thought, please suggest.

推荐答案

编辑:此算法存在问题,特别是递归合并列表使其成为多项式运行时算法.但我将把它作为一个有缺陷的算法的例子留在这里.

Edit: There are problems with this algorithm, specifically that recursively merging lists makes this a polynomial-runtime algorithm. But I'll leave it here as an example of a flawed algorithm.

您不能从块中丢弃任何单词,因为可能有一个单词仅在一个块中存在 100 次,而另一个单词在 100 个不同块中的每个块中都存在一次.

You cannot discard any words from your chunks because there may be one word that exists 100 times in only one chunk, and another that exists one time in each of 100 different chunks.

但是您仍然可以使用类似于 MapReduce 算法的方式使用块.您将每个块映射到一个单词列表(包括计数),然后通过递归地将单词列表合并为一个来减少.

But you can still work with chunks, in a way similar to a MapReduce algorithm. You map each chunk to a word list (including count), then you reduce by recursively merging the word lists into one.

在映射步骤中,将每个单词映射到每个块的计数.按字母顺序排序,而不是按计数并将列表存储到磁盘.现在您可以成对线性合并列表,而无需在内存中保留两个以上的单词:

In the map step, map each word to a count for each chunk. Sort alphabetically, not by count and store the lists to disk. Now you can merge the lists pairwise linearly without keeping more than two words in memory:

  1. 令 A 和 B 为要合并的列表文件,R 为结果文件
  2. 从A中读取一行单词+count,调用单词a
  3. 从 B 中读取一行单词+count,调用单词b
  4. 按字母顺序比较单词:
    • 如果 a = b:
      • 总结他们的计数
      • 将单词和新计数写入 R
      • 转到 2
  • b 包括其计数写入 R
  • 从 B 中读取一个新行 b
  • 转到 4
  • a 包括其计数写入 R
  • 从A读取一个新行a
  • 转到 4

继续进行这种成对合并,直到所有文件合并到一个列表中.然后您可以扫描一次结果列表并保留最常用的十个单词.

Continue to do this pairwise merge until all files are merged into a single list. Then you can scan the result list once and keep the ten most frequent words.

这篇关于找出一本大书中最常用的 10 个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆