在不适合内存的巨大文件中查找"n"个最重复的单词/字符串 [英] Find the 'n' most repeating words/strings in a huge file that does not fit in memory

查看:102
本文介绍了在不适合内存的巨大文件中查找"n"个最重复的单词/字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想验证我的伪代码,提出优化建议和更好的方法.

I would like to verify my psuedocode, suggest optimizations and better approaches.

mostRepeatingWords(int rank):
(此处的排名定义了您要选择的排名,即前X个最重复的单词)

mostRepeatingWords(int rank):
(rank here defines how many ranks you want to choose, i.e. top X most repeating words)

  1. 外部对所有文件按字母顺序排序.遵循此处提到的算法. br> 复杂度:ø(n log n/m)

  1. External sort all files alphabetically. Following the algorithm mentioned here.
    Complexity:ø (n log n/m)

创建一个新文件"wordCount",其中将包含类似条目

Create a new file "wordCount" which would contain entries like

"word1" : 3-即word1重复了3次
"word2" : 1-即word2是唯一的.

"word1" : 3 - i.e. word1 was repeated three times
"word2" : 1 - i.e. word2 was unique.

 

for ( read each of the sorted file one by one ) { 
    for (read each "word" in the current file) { 
        int count =  checkIfDuplicateAndReturnCount(); 
        enter "word" and "count" in the "wordCount" file. 

        sanitizeFile(); 

        if  (wordCount file is > page size) { 
                write wordCount file to disk. 
                create a new wordCount file.
         } 

        move to next word ie currentword + count;
    } 
}

复杂度O(n + p),其中n是已排序的页面数,p <= n,是单词计数"页面数

Complexity O (n + p) where n is the number of sorted pages and p <= n, is number of "wordcount" pages

checkIfDuplicateAndReturnCount()是一个简单的函数,它将比较此元素与第一个和前一个等,并确定单词的频率.

checkIfDuplicateAndReturnCount() is a simple function which will compare this element with first and previous etc. and determine the frequency of the word.

sanitizeFile().假设一个页面的大小为4KB,但是说到排序时,包含单词"the"的单词的页数大于(4KB/单词的大小).然后,我们可能需要创建一个新文件.但是清理文件将采取额外的步骤,将文件中的两个条目合并为同一个单词.示例:

sanitizeFile() is used when pages are all flooded with same word. Lets say size of a page is 4KB, and let's say but number of pages, on sorting, containing the word common word "the" is greater than (4KB / size of word). Then we may need to create a new file. But sanitize file will take an extra step to combine two entries in the file for same word. Example:

  • the:500
  • the:600

将被组合为:

  • the:1100

 

for  (reading the all the wordcount files one by one ) {
  for (each word in the wordcount) {
    Use a minimum heap / priority queue of size "rank" to keep track of the frequency.
    if (wordcount > heap.peek()) {
        heap.addAtPositionZero(word);
    }   
  }
}

复杂度为O(p)

复杂度:ø(n log n/m)+ O(n + p)+ O(p)有效:ø(n log n/m)

Complexity: ø ( n log n / m ) + O (n + p ) + O(p) effectively : ø ( n log n / m )

推荐答案

无需提前对文件进行排序.这样做只是一堆不必要的I/O.

There's no need to sort the files ahead of time. Doing so is just a bunch of unnecessary I/O.

一种更简单的方法是:

Create an empty dictionary (hash map), keyed by word. The value is the count.
for each file
    while not end of file
        read word
        if word in dictionary
            update count
        else
            if dictionary full
                sort dictionary by word
                output dictionary to temporary file
                Clear dictionary
            Add word to dictionary, with count 1
    end
end
if dictionary not empty
    sort dictionary by word
    output dictionary to temporary file

您现在有一些临时文件,每个临时文件按单词排序,每行包含一个单词/计数对.喜欢:

You now have some number of temporary files, each sorted by word and containing one word/count pair per line. Like:

aardvark,12
bozo,3
zebra,5

创建一个最小堆,将用于存放n个最大的项.称为largest_items.

Create a min-heap that you will use to hold your n largest items. Call it largest_items.

对这些临时文件进行标准的 n-way合并.当您找到每个唯一项时(即您将多个"aardvark"条目合并到多个文件中),您可以执行以下操作:

Do a standard n-way merge of those temporary files. As you find each unique item (i.e. you merge all of the "aardvark" entries across the multiple files), you do this:

if (largest_items.count < n)
    largest_items.add(word)
else if (word.count > largest_items.peek().count)
{
    // the count for this word is more than the smallest count
    // already on the heap. So remove the item with the
    // smallest count, and add this one.
    largest_items.remove_root()
    largest_items.add(word)
}

复杂度:

  • 构建词典的时间为O(N),其中N是文件中单个单词的总数.
  • 将每个临时词典排序为O(k log k),其中"k"是词典中的单词数.
  • 写每个临时字典是O(k)
  • 合并为O(M log x),其中M是所有临时文件中条目的合并数,而x是临时文件中的数目.
  • 选择项目的次数为O(m log n),其中m是唯一单词的数量,而n是要选择的单词数量.
  • Building the dictionaries is O(N), where N is the total number of individual words in the file.
  • Sorting each temporary dictionary is O(k log k), where 'k' is the number of words in the dictionary.
  • Writing each temporary dictionary is O(k)
  • The merge is O(M log x), where M is the combined number of entries across all the temporary files, and x is the number of temporary files.
  • Selecting the items is O(m log n), where m is the number of unique words, and n is the number of words you want to select.

如果您查看最坏情况的行为(即所有单词都是唯一的),那么复杂度就可以算出(n是单词的总数):

If you look at worst case behavior (i.e. all the words are unique), the complexity works out to (n is the total number of words):

  • 构建词典为O(n)
  • 排序和写入临时文件为(n/m) * (m log m),其中m是词典大小.
  • 合并为n log (n/m).
  • 选择为O(m +(k log k)),其中k是要选择的单词数,而m是唯一单词的数.因为所有单词都是唯一的,所以它们具有相同的计数,因此您只需要k插入堆中即可.当km小得多时,m术语占主导地位(在这种情况下通常是这种情况).因此选择结果为O(m).
  • Building the dictionaries is O(n)
  • Sorting and writing the temporary files is (n/m) * (m log m), where m is the dictionary size.
  • The merge is n log (n/m).
  • Selection is O(m + (k log k)), where k is the number of words you want to select and m is the number of unique words. Because all words are unique they have the same count, so you'll only do k inserts into the heap. The m term dominates when k is much smaller than m (which is usually the case in these situations). So selection turns out to be O(m).

当使用大于内存的数据集时,瓶颈通常是文件I/O.我上面概述的算法试图使I/O最小化.在最坏的情况下(所有单词都是唯一的),每个单词将被读取两次并写入一次.但是在一般情况下,每个单词读取一次,然后每个哈希页写入一次,然后读取一次.另外,您的排序是在哈希页面上而不是原始单词上,并且临时文件的总大小将比原始文本小得多.

When you're working with data sets larger than memory, very often your bottleneck is file I/O. The algorithm I've outlined above tries to minimize the I/O. In the worst case (all words are unique), each word will be read twice and written once. But in the general case each word is read once and then each hash page is written once and read once. Plus, your sorts are on hash pages rather than raw words, and the total size of your temporary files will be much smaller than the original text.

这篇关于在不适合内存的巨大文件中查找"n"个最重复的单词/字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆