将键值对划分为相等的列表,而无需访问键值计数 [英] Divide key value pairs into equal lists without access to key value counts

查看:123
本文介绍了将键值对划分为相等的列表,而无需访问键值计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题简短地说:如果不知道任何键包含的值的数量,而又不能容纳所有键,是否可以使用一种算法将键值对划分为大致相等的长度列表(或它们的值计数)同时在RAM中存储?

My question briefly stated: Is there an algorithm one can use to divide key value pairs into roughly equal length lists if one doesn't know apriori the number of values that any key contains, and one can't hold all keys (or counts of their values) in RAM concurrently?

关于上下文的问题:我有多个包含键/值对的文件,其中键是散列,值是其中发生给定哈希的对象ID的列表.相同的密钥在每个文件中出现0或1次,并且给定的密钥经常出现在许多文件中.

My question with context: I have multiple files that contain key/value pairs, where keys are hashes and values are lists of object ids in which the given hash occurs. The same key appears zero or one times in each of these files, and frequently a given key appears in many of the files.

我正在将这些文件读入计算集群中运行的多个工作程序中.每个工作人员都被分配了密钥的一个子集.对于分配给工作人员的每个密钥,该工作人员都会累积出现在前面提到的任何密钥/值文件中的密钥的所有值.然后,每个工作人员都读取所有前面提到的文件,找到每个键的所有值,然后将单个输出文件写入磁盘.

I am reading those files into several workers running in a compute cluster. Each worker is assigned a subset of the keys. For each key a worker is assigned, the worker accumulates all of the values for the key that occur in any of the previously mentioned key/value files. Each worker then reads all of the previously-mentioned files, finds all values for each of its keys, and writes a single output file to disk.

我面临的问题是,工作人员在分配的键之间累积了数量迥然不同的值,因此他们的RAM要求完全不同(从低端的33GB到高端的139GB).现在,要为工作人员分配密钥,我对每个密钥进行了sha1哈希处理,如果sha1(key) % total_number_of_workers == worker_id(其中工作人员id是给定工作人员在所有工作人员中的索引位置),则会为工作人员分配给定密钥.

The trouble I'm facing is that the workers are accumulating wildly different numbers of values among their assigned keys, so their RAM requirements are quite different (from 33GB on the low end to 139GB on the high). Right now, to assign keys to workers, I take a sha1 hash of each key, and if sha1(key) % total_number_of_workers == worker_id (where worker id is a given worker's index position among all workers) then the worker is assigned the given key.

是否有一种方法可以为工作人员分配密钥,这将有助于确保在节点之间更均匀地分配RAM要求?其他人对此问题的任何建议将不胜感激!

Is there a way to assign keys to workers that will help ensure a more equal distribution of RAM requirements among the nodes? Any advice others can offer on this question would be greatly appreciated!

如果其他人可能对它感兴趣,我整理了一个吉米·米歇尔(Kim Mischel)在Python下面描述的k路合并的简单实现[

In case it might be of interest to others, I put together a simple implementation of a k-way merge that Jim Mischel describes below in Python [gist]. This implementation doesn't require one to have all text files in memory concurrently, which may be impossible for large datasets.

推荐答案

这是一个简单的 k-方式合并.假设您有三个文件:

It's a simple k-way merge. Let's say you have three files:

File 1     File 2     File 3
A=3        B=7        C=22
X=9        B=4        D=19
Q=33       Z=26       A=2
X=47       X=12       D=13

现在,您对这些文件进行排序:

Now, you sort those files:

Sorted1    Sorted2    Sorted3
A=3        B=7        A=2
Q=33       B=4        C=22
X=9        X=12       D=19
X=47       Z=26       D=13

您可以执行合并步骤并最终得到一个文件:

You could do a merge step and end up with a single file:

A=3
A=2
B=7
B=4
C=22
D=19
D=13
Q=33
X=9
X=47
X=12
Z=26

然后浏览该文件,累积并写入值.

And then scan through that file, accumulating and writing values.

但是您可以在一个步骤中进行合并和累积.毕竟,当您进行合并时,您将按已排序的键顺序输出内容,因此您所要做的就是在输出步骤之前插入累积代码.

But you can do the merge and accumulation in a single step. After all, when you do the merge you're outputting things in sorted key order, so all you have to do is insert the accumulation code before the output step.

单个进程启动并创建一个优先级队列,该优先级队列包含每个文件中的第一项.因此,优先级队列将包含[A=3, B=7, A=2].该程序从优先级队列中获取最小的密钥A = 3,并使用第一个排序文件中的下一个项目刷新该队列.队列现在包含[Q=33,B=7,A=2].

A single process starts up and creates a priority queue that contains the first item from each file. So the priority queue would contain [A=3, B=7, A=2]. The program takes the smallest key, A=3, from the priority queue, and the queue is refreshed with the next item from the first sorted file. The queue now contains [Q=33,B=7,A=2].

程序使用键A创建一个新数组,其中包含值[3].然后,它再次进入队列并读取最小值:A = 2.它看到密钥等于它正在处理的密钥,因此它将数组更新为[3,2].队列从已排序的文件刷新,因此现在包含[Q=33,B=7,C=22].

The program creates a new array with key A, containing the value [3]. Then it goes to the queue again and reads the smallest value: A=2. It sees that the key is equal to the one it's working on, so it updates the array to [3,2]. The queue is refreshed from the sorted file, so now it contains [Q=33,B=7,C=22].

再次,程序从队列中获取最小的键值.这次是B.B不等于A,因此程序输出A,[3,2],将当前键替换为B,并将累积数组替换为[7].

Once again, the program gets the smallest key value from the queue. This time it's B. B is not equal to A, so the program outputs A,[3,2], replaces the current key with B, and replaces the accumulation array with [7].

这一直持续到没有更多项目可以合并为止.

This continues until there are no more items to be merged.

用于处理重新填充优先级队列的代码有点儿怪异,但并不是很困难.

The code to handle refilling the priority queue is a bit fiddly, but not really difficult.

另一种方法是使用操作系统的sort实用程序对文件进行排序和合并,然后编写一个简单的循环,线性循环遍历单个已排序的文件以累积值.

An alternative is to use your operating system's sort utility to sort and merge the files, and then write a simple loop that goes through the single sorted file linearly to accumulate the values.

这篇关于将键值对划分为相等的列表,而无需访问键值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆