在Spark中保存有序数据帧 [英] Saving ordered dataframe in Spark

查看:154
本文介绍了在Spark中保存有序数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将有序数据帧保存到HDFS中。我的代码如下:

I'm trying to save ordered dataframe into HDFS. My code looks like this:

dataFrame.orderBy("index").write().mode(SaveMode.Overwrite).parquet(getPath());

我在两个不同的集群上运行相同的代码,一个集群使用Spark 1.5.0,另一个集群使用1.6。 0。在使用Spark 1.5.0的群集上运行时,在保存到磁盘上后不会保留排序。

I run same code on two different clusters, one cluster uses Spark 1.5.0, another - 1.6.0. When running on cluster with Spark 1.5.0 it does not preserve sorting after saving on disc.

在将数据保存到磁盘上期间是否有任何特定的群集设置来保留排序?还是spark版本的已知问题?我搜索了spark文档,但找不到任何有关的信息。

Is there any specific cluster settings to preserve sorting during saving data on disc? or is it a known problem of the spark version? I've searched spark documentation but couldn't find any info about.

更新:

我已经检查了实木复合地板中的文件,并且在两种情况下都对文件进行了排序。因此在阅读时会出现问题,Spark 1.5.0在阅读时不会保留排序,在1.6.0时会保留排序。

I've checked files in parquet and in both cases files are sorted. So problem occures while reading, Spark 1.5.0 doesn't preserve ordering while reading and 1.6.0 does.

所以我现在的问题是:

推荐答案

这里有几件事发生:


  1. 在编写时,spark将数据分成几个分区,并分别写入,因此即使对数据进行排序也是如此

  1. When you are writing, spark splits the data into several partitions and those are written separately so even if the data is ordered it is split.

在读取分区时,不要保存分区之间的顺序,因此只能对块进行排序。更糟糕的是,文件与分区的1:1映射可能有所不同:

When you are reading the partitions do not save ordering between them, so you would be sorted only blocks. Worse, there might be something different than a 1:1 mapping of file to partition:


  • 几个文件可能被映射到分区中的单个分区错误的顺序导致分区内的排序仅在块中为真

  • 单个文件可能会在分区之间划分(如果它大于块大小)。

基于上述内容,最简单的解决方案是在以下情况下将分区重新分区(或合并为1)写作,因此有1个文件。读取该文件时,如果文件小于块大小,则将对数据进行排序(您甚至可以使块大小非常大以确保这一点)。

Based on the above, the easiest solution would be to repartition (or rather coalesce) to 1 when writing and thus have 1 file. When that file is read the data would be ordered if the file is smaller than the block size (you can even make the block size very large to ensure this).

此解决方案的问题在于它降低了并行性(当您编写时需要重新分区,而在阅读时则需要重新分区以获取并行性。) / repartition可能很昂贵
此解决方案的第二个问题是扩展性不佳(您可能会得到一个很大的文件)。

The problem with this solution is that it reduces your parallelism (when you write you need to repartition and when you read you would need to repartition again to get parallelism. The coalesce/repartition can be costly. The second problem with this solution is that it doesn't scale well (you might end up with a huge file).

一个更好的解决方案将基于您的用例,基本情况是您可以在排序之前使用分区,例如,如果您计划进行需要排序的自定义聚合,那么请确保保持1:1可以确保文件和分区之间的映射关系足以满足您的需要,也可以将每个分区内的最大值添加为第二个值,然后对其进行分组并进行第二次排序。

A better solution would be based on your use case. The basic would be if you can use partitioning before sorting. For example, if you are planning to do a custom aggregation that requires the sorting then if you make sure to keep a 1:1 mapping between files and partitions you can be assured of the sorting within the partition which might be enough for you. You can also add the maximum value inside each partition as a second value and then groupby it and do a secondary sort.

这篇关于在Spark中保存有序数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆