对 pcollection 元素进行排名 [英] Ranking pcollection elements

查看:20
本文介绍了对 pcollection 元素进行排名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Google DataFlow Java SDK 2.2.0.用例如下:

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:

PCollection pEmployees:员工和对应的部门名称.最多可包含 1000 万个元素.

PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.

PCollection pDepartments:部门名称和每个部门要发布的元素数量.将包含几百个元素.

PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.

task:根据 pDepartments 中所有部门的部门级编号从 pEmployees 收集元素.这将是一个很大的集合(最多几十万个元素或几个 GB).

task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).

我们不能在此处使用 Top 转换,因为它会在 pEmployee 上一次工作一个,而我们有多个部门,而且在 PCollection 中也是如此.我们可以为 pEmployees 中的每个元素分配一个行号,将其与 pDepartments 连接起来,并从 pDepartments 中过滤 row_number > target number 的记录.这将需要一个全球排名.

We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.

问题:我们如何为 pcollection 中的元素分配等级/行号?

Question: how can we assign rank/row numbers to the elements in a pcollection?.

推荐答案

这与 Sample 变换非常接近,但不完全相同,因为当用作 <代码>.perKey().一般情况下,Beam 目前不支持使用不同组合函数参数的 per-key 组合.

This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.

我建议通过使用 CoGroupByKey 加入 pEmployeespDepartments 来模拟它并获得元组 (CoGbkResult>) 包含部门名称,N = 元素数,以及该部门的所有员工.然后简单地遍历员工并发出第一个 N 并丢弃其余的.

I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.

这篇关于对 pcollection 元素进行排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆