Spark - 以编程方式创建具有不同数据类型的模式 [英] Spark - creating schema programmatically with different data types

查看:30
本文介绍了Spark - 以编程方式创建具有不同数据类型的模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由 7-8 个字段组成的数据集,这些字段的类型为 String、Int &浮动.

I have a dataset consisting of 7-8 fields which are of type String, Int & Float.

我正在尝试使用此方法通过编程方法创建架构:

Am trying to create Schema by Programmatic approach by using this :

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

然后将其映射到 Row 类型,例如:

And Then mapping it to Row type like :

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

但是在我使用 DF.show() 创建 DataFrame 之后,它给出了 Integer 字段的错误.

But after creating DataFrame when i use DF.show() it gives error for the Integer field.

那么如何在数据集中有多种数据类型的情况下创建这样的模式

So how to create such schema where we have multiple data type in the dataset

推荐答案

您在代码中遇到的问题是您将所有字段分配为 StringType.

The problem you have in your code is that you are assigning all the fields as StringType.

假设标题中只有字段的名称,那么您无法猜测类型.

Assuming that in the header you have only the name of the fields, then you can't guess the type.

假设标题字符串是这样的

Let's assume that the header string is like this

val header = "field1:Int,field2:Double,field3:String"

那么代码应该是

def inferType(field: String) = field.split(":")(1) match {
   case "Int" => IntegerType
   case "Double" => DoubleType
   case "String" => StringType
   case _ => StringType
}

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))

对于你得到的标题字符串示例

For the header string example you get

root
 |-- field1:Int: integer (nullable = true)
 |-- field2:Double: double (nullable = true)
 |-- field3:String: string (nullable = true)

另一方面.如果您需要的是来自文本的数据框,我建议您直接从文件本身创建数据框.从 RDD 创建它是没有意义的.

On the other hand. If what you need it's a data frame from text, I would suggest that you create the DataFrame directly from the file itself. It's pointless to create it from an RDD.

val fileReader = spark.read.format("com.databricks.spark.csv")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("inferschema", "true")
  .option("delimiter", ",")

val df = fileReader.load(PATH_TO_FILE)

这篇关于Spark - 以编程方式创建具有不同数据类型的模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆