将文件保存到Parquet时,分区列移动到行尾 [英] Partition column is moved to end of row when saving a file to Parquet

查看:89
本文介绍了将文件保存到Parquet时,分区列移动到行尾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于给定的DataFrame,在将保存保存为 parquet 之前,这里是模式:请注意, centroid0 第一列,且为 StringType :

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:

但是使用以下方法保存文件时:

However when saving the file using:

      df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)

,并且 partitionCols 作为 centroid0 :

然后有一个(对我来说)令人惊讶的结果:

then there is a (to me) surprising result:

  • centroid0 分区列已移至行的 end
  • 数据类型已更改为 Integer
  • the centroid0 partition column has been moved to the end of the Row
  • the data type has been changed to Integer

我通过 println 确认了输出路径:

I confirmed the output path via println :

 path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters

这是从保存的 parquet 中读取 back 后的架构:

And here is the schema upon reading back from the saved parquet:

为什么对输入模式进行了这两种修改-以及如何避免它们-仍将 centroid0 保留为分区列?

Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?

更新一个优选的答案应该是为什么将分区添加到列列表的 end (相对于开始).我们需要确定性排序的理解.

Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.

此外-有什么方法可以使 spark 对推断出的列类型改变主意"?我不得不将分区从 0 1 等更改为 c0 c1 等,以便获得推断要映射到 StringType .也许这是必需的..但是如果有一些火花设置可以改变行为,那么这将是一个很好的答案.

In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.

推荐答案

write.partitionBy(...)时,Spark将分区字段另存为文件夹这对于以后读取数据很有用,因为它可以优化(仅包括某些文件类型,包括镶木地板),仅从您使用的分区读取数据(即,如果您读取并过滤了centroid0 == 1则不会读取火花)其他分区

When you write.partitionBy(...) Spark saves the partition field(s) as folder(s) This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions

这样做的效果是,分区字段(在您的情况下为 centroid0 )不会仅作为文件夹名称( centroid0 = 1 centroid0 = 2 等)

The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)

这些的副作用是1.在运行时推断分区的类型(因为架构未保存在Parquet中),并且在您的情况下,您碰巧只有整数值,因此将其推断为integer

The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.

另一个副作用是,分区字段是在模式的末尾/开始添加的,因为它从拼花文件中读取模式作为一个块,然后又将分区字段作为另一个添加(一个或多个).,它不再是存储在实木复合地板中的架构的一部分)

The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)

这篇关于将文件保存到Parquet时,分区列移动到行尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆