SparkSQL DataFrame按跨分区排序 [英] SparkSQL DataFrame order by across partitions

查看:223
本文介绍了SparkSQL DataFrame按跨分区排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spark sql对我的数据集运行查询.查询的结果很小,但仍处于分区状态.

I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.

我想合并生成的DataFrame并按列对行进行排序.我尝试过

I would like to coalesce the resulting DataFrame and order the rows by a column. I tried

DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

我也尝试过

DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

输出文件按块排序(即分区是有序的,但数据帧不是整体上有序的).例如,代替

the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of

1, value
2, value
4, value
4, value
5, value
5, value
...

我知道

2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value

  1. 对查询结果进行绝对排序的正确方法是什么?
  2. 为什么不将数据帧合并到一个分区中?

推荐答案

我想在这里提及几件事. 1-源代码显示orderBy语句在内部调用sorting api并将global ordering设置为true.因此在输出级别缺少排序表明在写入目标时丢失了排序.我的观点是,对orderBy的调用始终需要全局订单.

I want to mention couple of things here . 1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.

2-像在您的情况中强行对单个分区进行操作那样,使用剧烈的合并可能真的很危险.我建议您不要这样做.源代码表明,调用Coalesce(1)可能会导致上游转换使用单个分区.这将是残酷的性能明智的选择.

2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.

3-您似乎希望orderBy语句可以在单个分区上执行.我认为我不同意这一说法.那将使Spark成为一个非常愚蠢的分布式框架.

3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.

社区,如果您同意或不同意声明,请让我知道.

Community please let me know if you agree or disagree with statements.

无论如何,您如何从输出中收集数据?

how are you collecting data from the output anyway?

也许输出实际上包含排序后的数据,但是您为了从输出中读取而执行的转换/动作是造成订单丢失的原因.

maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.

这篇关于SparkSQL DataFrame按跨分区排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆