排序收集元素 [英] Ranking pcollection elements
问题描述
我正在使用Google DataFlow Java SDK 2.2.0.用例如下:
I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees:员工和相应的部门名称.可能包含多达一千万个元素.
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments:部门名称和每个部门要发布的元素数.将包含数百个元素.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
任务:根据部门部门编号为pDepartments中的所有部门从pEmployees中收集元素.这将是一个很大的集合(最多几十万个元素或几个GB).
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
我们不能在此处使用Top转换,因为一次转换只能在pEmployee上使用,而我们在PCollection中有多个部门.我们可以为pEmployees中的每个元素分配一个行号,将其与pDepartments相连,并过滤记录,其中row_number> pDepartments中的目标号.这将需要全球排名.
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
问题:如何为pcollection中的元素分配等级/行号?
Question: how can we assign rank/row numbers to the elements in a pcollection?.
推荐答案
这与Sample
转换非常接近,但不完全相同,因为当用作.perKey()
时,它将相同的阈值应用于所有键.通常,Beam当前不支持具有不同组合功能参数的每键组合.
This is very close to the Sample
transform, but not quite, because it applies the same threshold to all keys when used as .perKey()
. Generally, Beam currently doesn't support per-key combines with different combine function parameters.
我建议通过使用CoGroupByKey
联接pEmployees
和pDepartments
并获得包含部门名称,N =元素数以及该部门中所有雇员的元组(CoGbkResult
)来模拟它.然后简单地遍历员工并发出第一个N并丢弃其余的.
I'd recommend to emulate it by using CoGroupByKey
to join pEmployees
and pDepartments
and obtain tuples (CoGbkResult
) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.
这篇关于排序收集元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!