DataFrame分区按嵌套列 [英] DataFrame partitionBy on nested columns

查看:210
本文介绍了DataFrame分区按嵌套列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在嵌套字段中调用partitionBy,如下所示:

I am trying to call partitionBy on a nested field like below:

val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)

运行它时出现以下错误.我确实在以下架构中将名称"列为字段.是否有其他格式可以指定嵌套的列名?

I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?

java.lang.RuntimeException:在模式StructType(StructField(name,StringType,true),StructField(time,StringType,true),StructField(data,StructType(StructField(dataDetails, StructType(StructField(name,StringType,true),StructField(id,StringType,true),true)),true))

java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))

这是我的json文件:

This is my json file:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data": {
    "type": "EventData",
    "dataDetails": {
      "name": "EventName"
      "id": "1234"

    }
  }
} 

推荐答案

这似乎是此处列出的已知问题: https://issues.apache.org/jira/browse/SPARK-18084

This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084

我也遇到了这个问题,并且可以解决该问题,因此可以取消嵌套数据集中的列.我的数据集与您的数据集有些不同,但这是策略...

I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...

原始Json:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data": {
    "type": "EventData",
    "dataDetails": {
      "name": "EventName"
      "id": "1234"

    }
  }
} 

修改后的Json:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data_type": "EventData",
  "data_dataDetails_name" : "EventName",
  "data_dataDetails_id": "1234"
  }
} 

获取修改后的Json的代码:

Code to get to Modified Json:

def main(args: Array[String]) {
  ...

  val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)

  data.printSchema

  data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}

def children(colname: String, df: DataFrame) = {
  val parent = df.schema.fields.filter(_.name == colname).head
  val fields = parent.dataType match {
    case x: StructType => x.fields
    case _ => Array.empty[StructField]
  }
  fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}

这篇关于DataFrame分区按嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆