Spark 2.0隐式编码器,当类型为Option [Seq [String]](标量)时处理缺少的列 [英] Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

查看:214
本文介绍了Spark 2.0隐式编码器,当类型为Option [Seq [String]](标量)时处理缺少的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们的数据源中缺少某些类型为Option [Seq [String]]的列时,我在编码数据时遇到了一些麻烦.理想情况下,我希望丢失的列数据用None填充.

I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None.

场景:

我们正在读取的某些实木复合地板文件中有 column1 ,但没有 column2 .

We have some parquet files that we are reading in that have column1 but not column2.

我们将这些镶木文件中的数据加载到Dataset中,并将其转换为MyType.

We load the data in from these parquet files into a Dataset, and cast it as MyType.

case class MyType(column1: Option[String], column2: Option[Seq[String]])

sqlContext.read.parquet("dataSource.parquet").as[MyType]

org.apache.spark.sql.AnalysisException:无法在给定输入列的情况下解析"column2":[column1];

org.apache.spark.sql.AnalysisException: cannot resolve 'column2' given input columns: [column1];

有没有一种方法可以将列2的数据创建为None?

Is there a way to create the Dataset with column2 data as None?

推荐答案

在简单的情况下,您可以提供一个初始模式,该模式是预期模式的超集.例如,在您的情况下:

In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:

val schema = Seq[MyType]().toDF.schema

Seq("a", "b", "c").map(Option(_))
  .toDF("column1")
  .write.parquet("/tmp/column1only")

val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show

+-------+-------+
|column1|column2|
+-------+-------+
|      a|   null|
|      b|   null|
|      c|   null|
+-------+-------+

df.first

MyType = MyType(Some(a),None)

这种方法可能有点脆弱,因此通常您应该使用SQL文字来填补空白:

This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:

spark.read.parquet("/tmp/column1only")
  // or ArrayType(StringType)
  .withColumn("column2", lit(null).cast("array<string>"))
  .as[MyType]
  .first

MyType = MyType(Some(a),None)

这篇关于Spark 2.0隐式编码器,当类型为Option [Seq [String]](标量)时处理缺少的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆