对 pcollection 元素进行排名 [英] Ranking pcollection elements

查看：20 发布时间：2021/11/11 22:40:27 google-cloud-dataflow apache-beam

本文介绍了对 pcollection 元素进行排名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是 Google DataFlow Java SDK 2.2.0.用例如下:

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:

PCollection pEmployees:员工和对应的部门名称.最多可包含 1000 万个元素.

PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.

PCollection pDepartments:部门名称和每个部门要发布的元素数量.将包含几百个元素.

PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.

task:根据 pDepartments 中所有部门的部门级编号从 pEmployees 收集元素.这将是一个很大的集合(最多几十万个元素或几个 GB).

task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).

我们不能在此处使用 Top 转换，因为它会在 pEmployee 上一次工作一个，而我们有多个部门，而且在 PCollection 中也是如此.我们可以为 pEmployees 中的每个元素分配一个行号，将其与 pDepartments 连接起来，并从 pDepartments 中过滤 row_number > target number 的记录.这将需要一个全球排名.

We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.

问题:我们如何为 pcollection 中的元素分配等级/行号?

Question: how can we assign rank/row numbers to the elements in a pcollection?.

对 pcollection 元素进行排名 [英] Ranking pcollection elements

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

对 pcollection 元素进行排名 [英] Ranking pcollection elements

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭