排序收集元素 [英] Ranking pcollection elements

查看:50
本文介绍了排序收集元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Google DataFlow Java SDK 2.2.0.用例如下:

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:

PCollection pEmployees:员工和相应的部门名称.可能包含多达一千万个元素.

PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.

PCollection pDepartments:部门名称和每个部门要发布的元素数.将包含数百个元素.

PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.

任务:根据部门部门编号为pDepartments中的所有部门从pEmployees中收集元素.这将是一个很大的集合(最多几十万个元素或几个GB).

task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).

我们不能在此处使用Top转换,因为一次转换只能在pEmployee上使用,而我们在PCollection中有多个部门.我们可以为pEmployees中的每个元素分配一个行号,将其与pDepartments相连,并过滤记录,其中row_number> pDepartments中的目标号.这将需要全球排名.

We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.

问题:如何为pcollection中的元素分配等级/行号?

Question: how can we assign rank/row numbers to the elements in a pcollection?.

推荐答案

这与Sample转换非常接近,但不完全相同,因为当用作.perKey()时,它将相同的阈值应用于所有键.通常,Beam当前不支持具有不同组合功能参数的每键组合.

This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.

我建议通过使用CoGroupByKey联接pEmployeespDepartments并获得包含部门名称,N =元素数以及该部门中所有雇员的元组(CoGbkResult)来模拟它.然后简单地遍历员工并发出第一个N并丢弃其余的.

I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.

这篇关于排序收集元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆