防止 DataFrame.partitionBy() 从架构中删除分区列 [英] Prevent DataFrame.partitionBy() from removing partitioned columns from schema

查看:32
本文介绍了防止 DataFrame.partitionBy() 从架构中删除分区列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我按如下方式对 DataFrame 进行分区:

I am partitioning a DataFrame as follows:

df.write.partitionBy("type", "category").parquet(config.outpath)

代码给出了预期的结果(即按类型和类别划分的数据).但是,类型"和类别"列已从数据/架构中删除.有没有办法防止这种行为?

The code gives the expected results (i.e. data partitioned by type & category). However, the "type" and "category" columns are removed from the data / schema. Is there a way to prevent this behaviour?

推荐答案

我能想到一种解决方法,虽然很蹩脚,但有效.

I can think of one workaround, which is rather lame, but works.

import spark.implicits._

val duplicated = df.withColumn("_type", $"type").withColumn("_category", $"category")
duplicated.write.partitionBy("_type", "_category").parquet(config.outpath)

我回答这个问题是希望有人能有比我更好的答案或解释(如果 OP 找到了更好的解决方案),不过,因为我有同样的问题.

I'm answering this question in hopes that someone would have a better answer or explanation than what I have (if OP has found a better solution), though, since I have the same question.

这篇关于防止 DataFrame.partitionBy() 从架构中删除分区列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆