在此用例中如何动态分区数据 [英] How to partition data dynamically in this use-case

查看:88
本文介绍了在此用例中如何动态分区数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spark-sql-2.4.1version.我有类似下面的代码. 我有以下情况.

I am using spark-sql-2.4.1version. I have a code something like below. I have scenario like below.

val superDataset = // load the whole data set of student marks records ... assume have 10 years data
val selectedYrsDataset  = superDataset.repartition("--GivenYears--") //i.e. GivenYears are 2010,2011

One the selectedYrsDataset   I need to calculate year wise toppers  on over all country-wise, state-wise, colleage-wise.

如何使用这种用例? 是否有可能进行动态分区,即在每个新逻辑步骤中,我们相应地在其中添加了另一个分区(列) 以这种方式对已经分区的数据集进行重新分区 避免大洗牌.

How to do this kind of use-case ? Is there any possibility of doing it dynamic parition i.e. in each-new-logic-step accordingly we add another partition ( column) in order to do repartition on already partitioned dataset , in such way to avoid major shuffling.

推荐答案

示例数据框:

+----+-------+-----+-------+-----+
|year|country|state|college|marks|
+----+-------+-----+-------+-----+
|2019|  India|    A|     AC|   15|
|2019|  India|    A|     AC|   25|
|2019|  India|    A|     AC|   35|
|2019|  India|    A|     AD|   40|
|2019|  India|    B|     AC|   15|
|2019|  India|    B|     AC|   50|
|2019|  India|    B|     BC|   65|
|2019|    USA|    A|     UC|   15|
|2019|    USA|    A|     UC|   65|
|2019|    USA|    A|     UD|   45|
|2019|    USA|    B|     UC|   44|
|2019|    USA|    B|     MC|   88|
|2019|    USA|    B|     MC|   90|
|2020|  India|    A|     AC|   65|
|2020|  India|    A|     AC|   33|
|2020|  India|    A|     AC|   55|
|2020|  India|    A|     AD|   70|
|2020|  India|    B|     AC|   88|
|2020|  India|    B|     AC|   60|
|2020|  India|    B|     BC|   45|
|2020|    USA|    A|     UC|   85|
|2020|    USA|    A|     UC|   55|
|2020|    USA|    A|     UD|   32|
|2020|    USA|    B|     UC|   64|
|2020|    USA|    B|     MC|   78|
|2020|    USA|    B|     MC|   80|
+----+-------+-----+-------+-----+

为了进行多维聚合,您可以通过两种方式来进行聚合,即使用分组集或在Spark中使用汇总. 要了解有关这些多维聚合的更多信息,请访问此链接多维聚合

In order to do multi dimensional aggregation you can do it in two ways i.e by using grouping sets or by using rollup in Spark. To read more about these multidimensional aggregation follow this link Multi-Dimensional Aggregation

使用汇总的解决方案如下:

The solution using rollup is provided as follows:

val ans_df = df.rollup("year","country","state","college").agg(max("marks").as("Marks"))

结果:

+----+-------+-----+-------+-----+
|year|country|state|college|Marks|
+----+-------+-----+-------+-----+
|2020|  India|    A|     AC|   65|
|2019|  India|    B|     BC|   65|
|2020|  India|    B|   null|   88|
|2019|    USA|    B|     UC|   44|
|2020|  India|    B|     AC|   88|
|2020|    USA| null|   null|   85|
|2019|  India|    A|     AC|   35|
|2019|    USA|    B|     MC|   90|
|2019|  India|    A|     AD|   40|
|2019|    USA|    A|     UD|   45|
|2019|    USA| null|   null|   90|
|2020|    USA|    A|     UD|   32|
|null|   null| null|   null|   90|
|2019|    USA|    B|   null|   90|
|2020|  India| null|   null|   88|
|2019|    USA|    A|   null|   65|
|2019|  India|    B|   null|   65|
|2019|    USA|    A|     UC|   65|
|2020|  India|    B|     BC|   45|
|2020|    USA|    B|     UC|   64|
+----+-------+-----+-------+-----+

此外,根据要求,Spark可以确保以最佳方式执行此操作,并在对附加列执行groupBy时利用已经分区的数据.示例-在键(年,国家,州,大学)将使用已经按键(年,国家,州)分组的数据,从而减少了可观的计算.

Moreover, as asked spark makes sure of doing this operation in an optimal manner and makes use of the already partitioned data on doing a groupBy on an additional column.Example - On doing a groupBy on key (year,country,state,college) the data already grouped on key (year,country,state) will be used, thereby reducing significant computation.

这篇关于在此用例中如何动态分区数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆