查找一本大书的10个最常用的单词 [英] Find the 10 most frequently used words in a large book

查看:227
本文介绍了查找一本大书的10个最常用的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这已被要求在论坛上几次,我没有发现任何'标签'的答案可能被认为是最合适的soluion - 这样问一次:

I am aware that this has been asked on the forum a couple of times, I did not find any 'TAGGED' answer which could be considered the most appropriate soluion - so asking again:

我们都给予了非常大的文本从书中所有这一切都不能放入内存。我们必须找到在文本中排名前10位最频繁存在的话。什么是最优化的(时间和空间)的方式来做到这一点?

We are given a very large text from book all of which cannot fit into the memory. We are required to find the top 10 most frequently occuring words in the text. What would be the most optimal (time and space) way to do this?

我的想法:

除以文件为k个相当大的块(使得每个块中的可被存储在存储器中)。现在,在每个块的执行外部排序。一旦我们有了(N / K) - 排序的文件在磁盘上(假设N是从书文本的总规模) - 我不知道我应该怎么继续下去,以便我能获得来自排名前10位元素K-排序的数组。

Divide the file into k-sizeable chunks (such that each of the chunks can be stored in the memory). Now, perform an external sort on each of the chunks. Once we have the (N/k)- sorted files on the disk (assuming N is the total size of the text from the book) - I am not sure how should I continue so that I can obtain the top 10-elements from the k-sorted arrays.

此外,如果有不同的思路,请建议。

Also, if there is a different line of thought, please suggest.

推荐答案

修改:有问题,这个算法,特别是递归合并列表使这是一个多项式算法的运行时间。但我会在这里把它作为一个有缺陷的算法的一个例子。

Edit: There are problems with this algorithm, specifically that recursively merging lists makes this a polynomial-runtime algorithm. But I'll leave it here as an example of a flawed algorithm.

可以不丢弃任何词语从组块因为有可能存在一时刻在每100个不同的组块一个字存在100倍仅在一个组块,而另一个

You cannot discard any words from your chunks because there may be one word that exists 100 times in only one chunk, and another that exists one time in each of 100 different chunks.

但你仍然可以使用块工作,在类似于麻$ P $的方式pduce 算法。映射每个块一个字清单(包括数),那么你减少递归地合并词汇表成一个。

But you can still work with chunks, in a way similar to a MapReduce algorithm. You map each chunk to a word list (including count), then you reduce by recursively merging the word lists into one.

在地图上的步骤,每个单词映射到每个块的计数。排序的字母的,而不是由计数和列表保存到磁盘。现在,您可以合并列出两两线没有保留在内存中两个以上的词:

In the map step, map each word to a count for each chunk. Sort alphabetically, not by count and store the lists to disk. Now you can merge the lists pairwise linearly without keeping more than two words in memory:

  1. 设A和B是列表中的文件进行合并,而R结果文件
  2. 在读一个符合字+从A数,称之为字 A
  3. 在读一个符合字+选自B算,叫字 B
  4. 进行比较的话按字母顺序排列:
    • 如果 A = B
      • 的总和计数
      • 写的字和新的计数至R
      • 转到2
  1. Let A and B be the list files to merge, and R the result file
  2. Read one line with word+count from A, call the word a
  3. Read one line with word+count from B, call the word b
  4. Compare the words alphabetically:
    • If a = b:
      • Sum their counts
      • Write the word and new count to R
      • Go to 2
  • 发表 B 包括其数量与R
  • 在阅读新行 B 选自B
  • 转至4
  • Write b including its count to R
  • Read a new line b from B
  • Go to 4
  • 发表 A 包括其数量与R
  • 在阅读新行 A 从A
  • 转至4
  • Write a including its count to R
  • Read a new line a from A
  • Go to 4

继续执行此成对合并,直到所有文件将被合并成一个单一的列表。然后,你可以扫描结果列表一次,并保持十个最频繁出现的词汇。

Continue to do this pairwise merge until all files are merged into a single list. Then you can scan the result list once and keep the ten most frequent words.

这篇关于查找一本大书的10个最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆