Spark:在Scala中以编程方式创建数据框架构 [英] Spark: Programmatically creating dataframe schema in scala

查看:69
本文介绍了Spark:在Scala中以编程方式创建数据框架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很小的数据集,这是Spark作业的结果.我正在考虑在工作结束时为了方便起见将此数据集转换为数据框,但一直在努力正确定义架构.问题是下面的最后一个字段(topValues);它是一个元组的ArrayBuffer-键和计数.

I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.

  val innerSchema =
    StructType(
      Array(
        StructField("value", StringType),
        StructField("count", LongType)
      )
    )
  val outputSchema =
    StructType(
      Array(
        StructField("name", StringType, nullable=false),
        StructField("index", IntegerType, nullable=false),
        StructField("count", LongType, nullable=false),
        StructField("empties", LongType, nullable=false),
        StructField("nulls", LongType, nullable=false),
        StructField("uniqueValues", LongType, nullable=false),
        StructField("mean", DoubleType),
        StructField("min", DoubleType),
        StructField("max", DoubleType),
        StructField("topValues", innerSchema)
      )
    )

  val result = stats.columnStats.map{ c =>
    Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
  }

  val rdd = sc.parallelize(result.toSeq)

  val outputDf = sqlContext.createDataFrame(rdd, outputSchema)

  outputDf.show()

我遇到的错误是MatchError:scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)

The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)

当我调试和检查对象时,会看到以下信息:

When I debug and inspect my objects, I'm seeing this:

rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]

在我看来,我已经在innerSchema中准确地描述了元组的ArrayBuffer,但是Spark对此表示反对.

It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.

知道我应该如何定义架构吗?

Any idea how I should be defining the schema?

推荐答案

val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
  rdd,
  StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)

df.printSchema
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = false)

df.show
+------------+
|         arr|
+------------+
|[1, 2, 3, 4]|
+------------+

这篇关于Spark:在Scala中以编程方式创建数据框架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆