并行十大算法分布式数据 [英] Parallel top ten algorithm for distributed data

查看:177
本文介绍了并行十大算法分布式数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个面试问题。假设有几台计算机,每台计算机不断访问过的URL的一个非常大的日志文件。找到的十大的参观人数最多的网址。

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.

例如:假设只有3台电脑,我们需要的前两名的参观人数最多的网址

For example: Suppose there are only 3 computers and we need the top two most visited URLs.


Computer A: url1, url2, url1, url3
Computer B: url4, url2, url1, url1
Computer C: url3, url4, url1, url3

url1 appears 5 times in all logs
url2 2
url3 3
url4 2 

So the answer is url1, url3

日志文件过大,以适应在RAM中并复制它们通过网络。据我了解,这一点也很重要,以使并行计算和使用所有给定的计算机。

The log files are too large to fit in RAM and copy them by network. As I understand, it is important also to make the computation parallel and use all given computers.

你将如何解决呢?

推荐答案

这是其中有一个众所周知的解决方案pretty的标准问题。您只需通过URL每台计算机上的日志文件进行排序,然后通过大小K(你想要的项目数)的优先级队列中的主计算机上的合并。自20世纪60年代这项技术已经存在了,而且至今仍在使用(虽然略有修改)的麻$形式p $ pduce

This is a pretty standard problem for which there is a well-known solution. You simply sort the log files on each computer by URL and then merge them through a priority queue of size k (the number of items you want) on the "master" computer. This technique has been around since the 1960s, and is still in use today (although slightly modified) in the form of MapReduce.

在每台计算机上,提取URL和日志文件的数量,并通过URL排序。因为日志文件大于将适合到内存中,你需要做的磁盘上的合并。这就需要阅读日志文件的数据块,由URL分类,写块磁盘。读取下一组块,分选,写入磁盘等。在某些时候,则具有M个记录文件块,每个块排序。然后,您可以做一个M-的方式合并。而不是写项目到磁盘,您present他们,排序顺序,但(由URL进行排序,这是),对大师。

On each computer, extract the URL and the count from the log file, and sort by URL. Because the log files are larger than will fit into memory, you need to do an on-disk merge. That entails reading a chunk of the log file, sorting by URL, writing the chunk to disk. Reading the next chunk, sorting, writing to disk, etc. At some point, you have M log file chunks, each sorted. You can then do an M-way merge. But instead of writing items to disk, you present them, in sorted order (sorted by URL, that is), to the "master".

每个机器分拣自己的日志。

Each machine sorts its own log.

在主计算机合并来自不同的计算机上的数据,并做顶K选用。这实际上是两个问题,但可以组合成一个。

The "master" computer merges the data from the separate computers and does the top K selection. This is actually two problems, but can be combined into one.

主创建两个优先级队列:一个用于合并,和一个用于顶K选用。第一种是大小为N,其中N是计算机它是从合并的数据的数量。第二个是K规格:想要的项数来选择。我用最小堆这一点,因为它很容易和相当快的。

The master creates two priority queues: one for the merge, and one for the top K selection. The first is of size N, where N is the number of computers it's merging data from. The second is of size K: the number of items you want to select. I use a min heap for this, as it's easy and reasonably fast.

要设置合并队列,初始化队列,从每个工人的计算机得到的第一个项目。在伪code以下,得到合并队列中最低的项目是指让根项目从合并队列,然后让从任何正在运行的电脑presented该项目的下一个项目。因此,如果队列中包含 [1,2,3] 和物品来自计算机B,C,A(按顺序),然​​后以最低的项目将意味着获得从计算机B中的下一个项目,并将其添加到优先级队列。

To set up the merge queue, initialize the queue and get the first item from each of the "worker" computers. In the pseudo-code below, "get lowest item from merge queue" means getting the root item from the merge queue and then getting the next item from whichever working computer presented that item. So if the queue contains [1, 2, 3], and the items came from computers B, C, A (in that order), then taking the lowest item would mean getting the next item from computer B and adding it to the priority queue.

主然后执行以下操作:

working = get lowest item from merge queue
while (items left to merge)
{
    temp = get lowest item from merge queue
    while (temp.url == working.url)
    {
        working.count += temp.count
        temp = get lowest item from merge queue
    }
    // Now have merged counts for one url.
    if (topK.Count < desired_count)
    {
        // topK queue doesn't have enough items yet.
        // so add this one.
        topK.Add(working);
    }
    else if (topK.Peek().count < working.count)
    {
        // the count for this url is larger
        // than the smallest item on the heap
        // replace smallest on the heap with this one
        topK.RemoveRoot()
        topK.Add(working)
    }
    working = temp;
}
// Here you need to check the last item:
if (topK.Peek().count < working.count)
{
    // the count for this url is larger
    // than the smallest item on the heap
    // replace smallest on the heap with this one
    topK.RemoveRoot()
    topK.Add(working)
}

在这一点上,<​​code> TOPK 队列具有最高的计数的k个项目。

At this point, the topK queue has the K items with the highest counts.

所以每一台计算机都有做归并排序,这是O(n log n)的,其中 N 是在该计算机的日志项数。在主合并是O(n),其中 N 是从个人电脑中的所有项目的总和。采摘的前k个项目是O(n日志k),其中 N 是多少的唯一的网址。

So each computer has to do a merge sort, which is O(n log n), where n is the number of items in that computer's log. The merge on the master is O(n), where n is the sum of all the items from the individual computers. Picking the top k items is O(n log k), where n is the number of unique urls.

的种种完成课程的同时,以每台电脑preparing自己的排序列表。但合并的那种部分已经完成,同时在主计算机的合并,所以有一定的协调,和所有的机器都参与在这个阶段。

The sorts are done in parallel, of course, with each computer preparing its own sorted list. But the "merge" part of the sort is done at the same time the master computer is merging, so there is some coordination, and all machines are involved at that stage.

这篇关于并行十大算法分布式数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆