防止DataFrame.partitionBy()从架构中删除分区列 [英] Prevent DataFrame.partitionBy() from removing partitioned columns from schema

查看:366
本文介绍了防止DataFrame.partitionBy()从架构中删除分区列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在按如下方式对DataFrame进行分区:

I am partitioning a DataFrame as follows:

df.write.partitionBy("type", "category").parquet(config.outpath)

该代码给出了预期的结果(即按类型&类别划分的数据).但是,类型"和类别"列已从数据/架构中删除.有办法防止这种行为吗?

The code gives the expected results (i.e. data partitioned by type & category). However, the "type" and "category" columns are removed from the data / schema. Is there a way to prevent this behaviour?

推荐答案

我可以想到一种解决方法,虽然相当la脚,但是可以解决问题.

I can think of one workaround, which is rather lame, but works.

import spark.implicits._

val duplicated = df.withColumn("_type", $"type").withColumn("_category", $"category")
duplicated.write.partitionBy("_type", "_category").parquet(config.outpath)

我正在回答这个问题,希望有人会比我有更好的答案或解释(如果OP找到了更好的解决方案),因为我有相同的问题.

I'm answering this question in hopes that someone would have a better answer or explanation than what I have (if OP has found a better solution), though, since I have the same question.

这篇关于防止DataFrame.partitionBy()从架构中删除分区列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆