高效的分组由密钥QUOT;在"分区 [英] Efficient grouping by key "within" partitions

查看:145
本文介绍了高效的分组由密钥QUOT;在"分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图适应的过程火花。
基本上,过程分析的数据来自批次JDBC数据源
每个记录具有batchId,也是一个更高层次的groupId。

I'm trying to adapt a process to spark. Basically, the process analyzes batches of data from a JDBC data source each record has a batchId, and also a higher level groupId.


  • 批次的数目是大的(事先未知)。

  • The number of batches is large (unknown in advance).

基团的数目是约100。

The number of groups is ~100.

记录每批的数量可以适合在RAM中。

The number of records for each batch can fit in RAM.

实际的分析code无所谓,但它并不适合reduceByKey或combineByKey更具体型号

The actual analyzing code doesn't matter, but it doesn't fit the more specific models of reduceByKey or combineByKey

我的想法是:


  • 使用jdbcRdd读取数据,使用组ID为分区

  • use jdbcRdd to read the data, use the "group id" for partitioning

使用GROUP BY batchId至prepare数据

use group by batchId to prepare the data

使用地图应用的业务逻辑。

use map to apply the business logic.

的瓶子颈部似乎是groupByKey,来自我的理解将强制洗牌(将数据写入磁盘) - 即使每个批次被包含在一个单一的分区。

The bottle neck appears to be the groupByKey, that from my understanding will force a shuffle (writing data to disk) - even though each batch is contained in a single partition.

的其他可能的方法是使用batchId为分区,但是这会造成非常大的数目的分区 - 和用于其的大量的查询

The other possible approach is to use batchId for partitioning, but this will create a very large number of partitions - and therefor a large number of queries

有没有办法通过键分区中执行的一组?
任何其他可能的方法呢?

Is there a way perform a group by key within a partition? Any other possible approach?

推荐答案

是的,你需要使用mapPartitions。你在分区中的所有记录访问迭代器。你只是从那里写斯卡拉code和可以做你喜欢,包括建立一个地图批次ID的记载内容。有适合在内存中,心,但你总是可以减少分区大小如果该事项。

Yes, you need to use mapPartitions. You access an Iterator over all records in the partition. You are just writing Scala code from there and can do what you like including build up a Map of batch ID to records. That has to fit in memory, mind, but you can always reduce partition size if that matters.

这篇关于高效的分组由密钥QUOT;在"分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆