spark 使用带有选项字段的案例类将数据帧转换为数据集 [英] spark convert dataframe to dataset using case class with option fields
问题描述
我有以下案例类:
case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}
以及以下 json:
{ "name": "bemjamin", "age" : 1 }
当我尝试将数据框转换为数据集时:
When I try to transform my dataframe into a dataset:
spark.read.json("example.json")
.as[Person].show()
它显示了以下错误:
线程main"org.apache.spark.sql.AnalysisException 中的异常:无法解析 'lastname
' 给定的输入列:[age, name];
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
lastname
' given input columns: [age, name];
我的问题是:如果我的架构是我的案例类并且它定义了姓氏是可选的,那么 as() 不应该进行转换吗?
My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?
我可以使用 .map 轻松解决此问题,但我想知道是否有其他更简洁的替代方法.
I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.
推荐答案
我们还有一个选项可以解决上述问题.需要 2 个步骤
We have one more option to solve above issue.There are 2 steps required
确保可能缺失的字段被声明为可以为空Scala 类型(如 Option[_]).
Make sure that fields that can be missing are declared as nullable Scala types(like Option[_]).
提供架构参数而不依赖于架构推断.例如,您可以使用 Spark SQL 编码器:
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
您可以按如下方式更新代码.
You can update code as below.
val schema = Encoders.product[Person].schema
val df = spark.read
.schema(schema)
.json("/Users/../Desktop/example.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
这篇关于spark 使用带有选项字段的案例类将数据帧转换为数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!