将文件保存到Parquet时，分区列移动到行尾 [英] Partition column is moved to end of row when saving a file to Parquet

查看：89 发布时间：2021/4/8 19:26:17 apache-spark parquet

本文介绍了将文件保存到Parquet时，分区列移动到行尾的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于给定的DataFrame，在将保存保存为 parquet 之前，这里是模式:请注意， centroid0 是第一列，且为 StringType :

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:

但是使用以下方法保存文件时:

However when saving the file using:

      df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)

，并且 partitionCols 作为 centroid0 :

然后有一个(对我来说)令人惊讶的结果:

then there is a (to me) surprising result:

centroid0 分区列已移至行的 end
数据类型已更改为 Integer

the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer

我通过 println 确认了输出路径:

I confirmed the output path via println :

 path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters

这是从保存的 parquet 中读取 back 后的架构:

And here is the schema upon reading back from the saved parquet:

为什么对输入模式进行了这两种修改-以及如何避免它们-仍将 centroid0 保留为分区列?

Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?

更新一个优选的答案应该是为什么将分区添加到列列表的 end (相对于开始).我们需要确定性排序的理解.

Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.

此外-有什么方法可以使 spark 对推断出的列类型改变主意"?我不得不将分区从 0 ， 1 等更改为 c0 ， c1 等，以便获得推断要映射到 StringType .也许这是必需的..但是如果有一些火花设置可以改变行为，那么这将是一个很好的答案.

In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.

将文件保存到Parquet时，分区列移动到行尾 [英] Partition column is moved to end of row when saving a file to Parquet

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将文件保存到Parquet时，分区列移动到行尾 [英] Partition column is moved to end of row when saving a file to Parquet

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭