数据帧GROUPBY行为/优化 [英] DataFrame groupBy behaviour/optimization

查看:196
本文介绍了数据帧GROUPBY行为/优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有数据帧东风由以下列:

Suppose we have DataFrame df consisting of the following columns:

姓名,尺寸,宽度,长度,称

Name, Surname, Size, Width, Length, Weigh

现在我们要执行几个操作,比如我们希望创建一对夫妇中约含尺寸和宽度数据DataFrames的。

Now we want to perform a couple of operations, for example we want to create a couple of DataFrames containing data about Size and Width.

val df1 = df.groupBy("surname").agg( sum("size") )
val df2 = df.groupBy("surname").agg( sum("width") )

你可以看到,其他列,如长度没有任何地方使用。星火是足够聪明的洗牌阶段前下降了冗余列或者是他们周围进行?威尔运行:

as you can notice, other columns, like Length are not used anywhere. Is Spark smart enough to drop the redundant columns before the shuffling phase or are they carried around? Wil running:

val dfBasic = df.select("surname", "size", "width")

不知何故分组之前影响性能?

before grouping somehow affect the performance?

推荐答案

是的,它是足够聪明的。在数据帧是不一样的操作 GROUPBY GROUPBY >在一个普通的RDD进行。在一个场景中你所描述也没有必要在所有移动原始数据。让我们创建一个小例子来说明:

Yes, it is "smart enough". groupBy performed on a DataFrame is not the same operation as groupBy performed on a plain RDD. In a scenario you've described there is no need to move raw data at all. Let's create a small example to illustrate that:

val df = sc.parallelize(Seq(
   ("a", "foo", 1), ("a", "foo", 3), ("b", "bar", 5), ("b", "bar", 1)
)).toDF("x", "y", "z")

df.groupBy("x").agg(sum($"z")).explain

//  == Physical Plan ==
//  TungstenAggregate(key=[x#3], functions=[(sum(cast(z#5 as bigint)),mode=Final,isDistinct=false)], output=[x#3,sum(z)#11L])
//   TungstenExchange hashpartitioning(x#3)
//    TungstenAggregate(key=[x#3], functions=[(sum(cast(z#5 as bigint)),mode=Partial,isDistinct=false)], output=[x#3,currentSum#20L])
//     TungstenProject [_1#0 AS x#3,_3#2 AS z#5]
//      Scan PhysicalRDD[_1#0,_2#1,_3#2]

你可以第一个阶段是一个投影,其中仅需要列preserved。接下来的数据汇总在本地并最终转移和全球汇总。如果你使用星火&LT你会得到一点点不同的答案输出;。= 1.4,但总体结构应该是完全一样的。

As you can the first phase is a projection where only required columns are preserved. Next data is aggregated locally and finally transferred and aggregated globally. You'll get a little bit different answer output if you use Spark <= 1.4 but general structure should be exactly the same.

最后显示,上述描述一个DAG可视化描述实际工作:

Finally a DAG visualization showing that above description describes actual job:

在这里输入的形象描述

这篇关于数据帧GROUPBY行为/优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆