星火：GROUPBY服用大量的时间 [英] Spark: groupBy taking lot of time

查看：132 发布时间：2016/5/22 16:49:12 aggregate apache-spark reduce

本文介绍了星火：GROUPBY服用大量的时间的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的应用程序采取perfromance数字时，GROUPBY正在蚕食大量的时间。

In my application when taking perfromance numbers, groupby is eating away lot of time.

我的RDD是低于晶格结构：

My RDD is of below strcuture:

JavaPairRDD<CustomTuple, Map<String, Double>>

的 CustomTuple：的
该对象包含当前行中RDD信息，如哪个星期，月份，城市等。

CustomTuple: This object contains information about the current row in RDD like which week, month, city, etc.

public class CustomTuple implements Serializable{

private Map hierarchyMap = null;
private Map granularMap  = null;
private String timePeriod = null;
private String sourceKey  = null;
}

的地图的

该地图包含有关该行像多少投资，多少的GRP等方面的统计数据。

This map contains the statistical data about that row like how much investment, how many GRPs, etc.

<"Inv", 20>

<"GRP", 30>

我在此RDD执行以下DAG

I was executing below DAG on this RDD

在此RDD和范围了相关行申请滤芯：过滤

在此RDD和范围了相关行申请滤芯：过滤

加入RDDS：加入

申请map阶段来计算投资：地图

根据所需视图申请GROUPBY相位对数据进行分组：GROUPBY

施加地图相聚合数据按在上述步骤取得的分组（说跨TIMEPERIOD视图数据），并且还根据需要被收集的结果集创建新的对象：地图

收集结果：收集

所以，如果用户要查看跨越下面，然后返回列表的时间段投资（这是在上述第4步实现）：

So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):

<timeperiod1, value>

当我检查了作业所用的时间，GROUPBY正在采取的执行整个DAG所花费的时间的90％。

When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.

IMO，我们可以通过唱减少更换GROUPBY和后续的地图操作。
但是，减少对类型JavaPairRDD的对象将工作>。
所以，我会减少是T减少（T，T，T），其中T将CustomTuple，地图。

IMO, we can replace GroupBy and subsequent Map operations by a sing reduce. But reduce will work on object of type JavaPairRDD>. So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.

或者第3步以上DAG后，也许我跑，返回了我一个RDD类型对需要进行汇总，然后运行一个降低指标另一个地图功能。

Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.

另外，我不知道函数是如何工作的汇总，并将它能够帮助我在这种情况下。

Also, I am not sure how aggregate function works and will it be able to help me in this case.

其次，我的应用程序将在不同的按键接收请求。在我目前的RDD设计的每个请求都需要我重新分区或重新组我RDD这个键。这意味着每个请求分组/重新分配将采取我的时间的95％来计算的工作。

Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.

<"market1", 20>
<"market2", 30>

这是非常令人沮丧的应用程序，而不星火目前的表现比用星火更好的性能的10倍。

This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.

任何有识之士为AP preciated。

Any insight is appreciated.

我们也注意到，JOIN正在采取了大量的时间。也许这就是为什么GROUPBY正在采取的时间。

We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.

TIA！

推荐答案

星火的文档鼓励你，避免在操作GROUPBY操作，而不是他们建议combineByKey或者一些它的子操作（reduceByKey或aggregateByKey）。你必须使用此操作，以便前后洗牌后，使聚合（Map的和缩小的阶段，如果我们使用Hadoop的术语）所以你的执行时间，将提高（我不kwown这是否是10倍更好，但它必须是更好）

The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)

如果我理解你的处理我认为你可以使用一个combineByKey操作的以下code的解释是由一阶code，但您可以翻译为Java code，没有太多努力的

If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.

combineByKey有三个参数：
combineByKey [C]（ createCombiner ：（V）⇒C， mergeValue ：（C，V）⇒C， mergeCombiners ：（C，C ）⇒C）：RDD [（K，C）]

combineByKey have three arguments: combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]

createCombiner ：在此操作你为了你的数据结合在一起创建一个新的类，所以你可以收集您的CustomTuple数据到一个新的类CustomTupleCombiner（我不知道，如果你想只作之和或者你希望将一些过程中应用这些数据，但其中一个选项可以在此操作中进行）

createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)

mergeValue ：在此操作中，你必须描述一个CustomTuple如何总和另一个CustumTupleCombiner（我再次presupposing一个简单的总结操作）。例如，如果你想通过总结关键数据，你会在你的CustumTupleCombiner类地图，以便操作应该是这样的：使CustumTupleCombiner.Map（CustomTuple.key）CustumTupleCombiner.sum（CustomTuple） - > CustomTuple.Map（ CustomTuple.key）+ CustumTupleCombiner.value

mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value

mergeCombiners ：在此操作中，你必须定义如何在我的例子合并两个合班，CustumTupleCombiner。因此，这将是类似CustumTupleCombiner1.merge（CustumTupleCombiner2），将是这样CustumTupleCombiner1.Map.keys.foreach（K - > CustumTupleCombiner1.Map（K）+ CustumTupleCombiner2.Map（K））或者类似的东西。

mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that

该pated code未证明（这甚至不会编译，因为我用vim做它），但我认为这可能对您的方案工作。

The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.

我希望这将是有用的。

这篇关于星火：GROUPBY服用大量的时间的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火：GROUPBY服用大量的时间 [英] Spark: groupBy taking lot of time

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火：GROUPBY服用大量的时间 [英] Spark: groupBy taking lot of time

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭