对 pcollection 元素进行排名 [英] Ranking pcollection elements
问题描述
我使用的是 Google DataFlow Java SDK 2.2.0.用例如下:
I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees:员工和对应的部门名称.最多可包含 1000 万个元素.
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments:部门名称和每个部门要发布的元素数量.将包含几百个元素.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task:根据 pDepartments 中所有部门的部门级编号从 pEmployees 收集元素.这将是一个很大的集合(最多几十万个元素或几个 GB).
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
我们不能在此处使用 Top 转换,因为它会在 pEmployee 上一次工作一个,而我们有多个部门,而且在 PCollection 中也是如此.我们可以为 pEmployees 中的每个元素分配一个行号,将其与 pDepartments 连接起来,并从 pDepartments 中过滤 row_number > target number 的记录.这将需要一个全球排名.
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
问题:我们如何为 pcollection 中的元素分配等级/行号?
Question: how can we assign rank/row numbers to the elements in a pcollection?.
推荐答案
这与 Sample
变换非常接近,但不完全相同,因为当用作 <代码>.perKey().一般情况下,Beam 目前不支持使用不同组合函数参数的 per-key 组合.
This is very close to the Sample
transform, but not quite, because it applies the same threshold to all keys when used as .perKey()
. Generally, Beam currently doesn't support per-key combines with different combine function parameters.
我建议通过使用 CoGroupByKey
加入 pEmployees
和 pDepartments
来模拟它并获得元组 (CoGbkResult
>) 包含部门名称,N = 元素数,以及该部门的所有员工.然后简单地遍历员工并发出第一个 N 并丢弃其余的.
I'd recommend to emulate it by using CoGroupByKey
to join pEmployees
and pDepartments
and obtain tuples (CoGbkResult
) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.
这篇关于对 pcollection 元素进行排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!