Spark-以编程方式创建具有不同数据类型的架构 [英] Spark - creating schema programmatically with different data types

查看:148
本文介绍了Spark-以编程方式创建具有不同数据类型的架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由7-8个字段组成的数据集,这些字段的类型为String,Int&浮动.

I have a dataset consisting of 7-8 fields which are of type String, Int & Float.

我正在尝试使用以下方法通过程序化方法创建模式:

Am trying to create Schema by Programmatic approach by using this :

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

然后将其映射到Row类型,如:

And Then mapping it to Row type like :

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

但是当我使用DF.show()创建DataFrame之后,它为Integer字段给出了错误.

But after creating DataFrame when i use DF.show() it gives error for the Integer field.

那么如何在数据集中具有多种数据类型的情况下创建这样的模式

So how to create such schema where we have multiple data type in the dataset

推荐答案

代码中的问题是您将所有字段都分配为StringType.

The problem you have in your code is that you are assigning all the fields as StringType.

假设标题中仅包含字段名称,则无法猜测类型.

Assuming that in the header you have only the name of the fields, then you can't guess the type.

让我们假设标题字符串是这样的

Let's assume that the header string is like this

val header = "field1:Int,field2:Double,field3:String"

然后代码应该是

def inferType(field: String) = field.split(":")(1) match {
   case "Int" => IntegerType
   case "Double" => DoubleType
   case "String" => StringType
   case _ => StringType
}

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))

对于标题字符串示例,您将得到

For the header string example you get

root
 |-- field1:Int: integer (nullable = true)
 |-- field2:Double: double (nullable = true)
 |-- field3:String: string (nullable = true)

另一方面.如果您需要的是来自文本的数据框架,建议您直接从文件本身创建DataFrame.从RDD创建它是没有意义的.

On the other hand. If what you need it's a data frame from text, I would suggest that you create the DataFrame directly from the file itself. It's pointless to create it from an RDD.

val fileReader = spark.read.format("com.databricks.spark.csv")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("inferschema", "true")
  .option("delimiter", ",")

val df = fileReader.load(PATH_TO_FILE)

这篇关于Spark-以编程方式创建具有不同数据类型的架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆