使用具有案例类和列名别名的反射来进行Spark Dataframe模式定义 [英] Spark Dataframe schema definition using reflection with case classes and column name aliases

查看：90 发布时间：2021/2/13 21:28:04 json scala apache-spark reflection case-class

本文介绍了使用具有案例类和列名别名的反射来进行Spark Dataframe模式定义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的Spark Scala脚本遇到了一个小问题.基本上，我有原始数据，正在对分组和计数等进行聚合，然后将输出保存为特定的JSON格式.

I ran into a little problem with my Spark Scala script. Basically I have raw data which I am doing aggregations on and after grouping and counting etc I want to save the output to a specific JSON format.

我试图简化问题并重写:

I tried to simplify the question and rewrote it:

当我使用Array[org.apache.spark.sql.Column]从源数据帧中选择数据时，其中列名具有别名，然后在尝试将行映射到case类时使用列名(或实际上是索引)作为变量，那么我得到一个任务无法序列化"例外.

When I select data from the source dataframe with an Array[org.apache.spark.sql.Column] where the column names have aliases, then using column names (or indeed indices) as variables when trying to map the rows to a case class, then I get a "Task not serializable" exception.

var dm = sqlContext.createDataFrame(Seq((1,"James"),(2,"Anna"))).toDF("id", "name")

val cl = dm.columns
val cl2 = cl.map(name => col(name).as(name.capitalize))
val dm2 = dm.select(cl2:_*)
val n = "Name"
case class Result(Name:String)
val r = dm2.map(row => Result(row.getAs(n))).toDF

第二部分或问题，我实际上需要最终的模式成为这些Result类对象的数组.我还没有弄清楚如何做到这一点.预期的结果应该具有这样的模式:

And the second part or the question, I actually need the final schema to be an array of these Result class objects. I still haven't figured out, how to do this as well. The expected result should have a schema like that:

    case class Test(var FilteredStatistics: Array[Result])
    val t = Test(Array(Result("Anna"), Result("James")))

    val t2 = sc.parallelize(Seq(t)).toDF

    scala> t2.printSchema
    root
     |-- FilteredStatistics: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- Name: string (nullable = true)

TL; DR :

当数据框列具有别名并且变量用作列名时，如何将数据框行映射到案例类对象?

How to map dataframe rows to a case class object when dataframe columns have aliases and variables are used for column names?

如何将这些case类对象添加到数组中?

How to add these case class objects to an array?

使用具有案例类和列名别名的反射来进行Spark Dataframe模式定义 [英] Spark Dataframe schema definition using reflection with case classes and column name aliases

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用具有案例类和列名别名的反射来进行Spark Dataframe模式定义 [英] Spark Dataframe schema definition using reflection with case classes and column name aliases

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭