如何在Spark中创建架构文件 [英] How to create a Schema file in Spark

查看：109 发布时间：2020/9/4 18:52:55 scala apache-spark-sql schema orc

本文介绍了如何在Spark中创建架构文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试读取架构文件(它是文本文件)，并将其应用于没有标题的CSV文件.由于我已经有一个架构文件，因此我不想使用InferSchema选项，这会增加开销.

I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema option which is an overhead.

我的输入模式文件如下所示，

My input schema file looks like below,

"num IntegerType","letter StringType"

我正在尝试下面的代码来创建模式文件，

I am trying the below code to create a schema file,

val schema_file = spark.read.textFile("D:\\Users\\Documents\\schemaFile.txt")
val struct_type = schema_file.flatMap(x => x.split(",")).map(b => (b.split(" ")(0).stripPrefix("\"").asInstanceOf[String],b.split(" ")(1).stripSuffix("\"").asInstanceOf[org.apache.spark.sql.types.DataType])).foreach(x=>println(x))

我收到如下错误

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.DataType

-字段(类:"org.apache.spark.sql.types.DataType"，名称:"_ 2") -根类:"scala.Tuple2"

- field (class: "org.apache.spark.sql.types.DataType", name: "_2") - root class: "scala.Tuple2"

并尝试使用它作为架构文件，同时使用如下所示的spark.read.csv并将其写为ORC文件

and trying to use this as a schema file while using spark.read.csv like below and write it as an ORC file

  val df=spark.read
      .format("org.apache.spark.csv")
      .option("header", false)
      .option("inferSchema", true)
      .option("samplingRatio",0.01)
      .option("nullValue", "NULL")
      .option("delimiter","|")
      .schema(schema_file)
      .csv("D:\\Users\\sampleFile.txt")
      .toDF().write.format("orc").save("D:\\Users\\ORC")

需要帮助将文本文件转换为架构文件并将输入的CSV文件转换为ORC.

Need help to convert a text file into a schema file and convert my input CSV file to ORC.

推荐答案

要从text文件创建模式，请创建match type的函数，并以

To create a schema from a text file create a function to match the type and return DataType as

def getType(raw: String): DataType = {
  raw match {
    case "ByteType" => ByteType
    case "ShortType" => ShortType
    case "IntegerType" => IntegerType
    case "LongType" => LongType
    case "FloatType" => FloatType
    case "DoubleType" => DoubleType
    case "BooleanType" => BooleanType
    case "TimestampType" => TimestampType
    case _ => StringType
  }
}

现在通过将模式文件读取为

Now create a schema by reading a schema file as

val schema = Source.fromFile("schema.txt").getLines().toList
  .flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" "))
  .map(x => StructField(x(0), getType(x(1)), true))

现在将CSV文件读取为

Now read the csv file as

spark.read
  .option("samplingRatio", "0.01")
  .option("delimiter", "|")
  .option("nullValue", "NULL")
  .schema(StructType(schema))
  .csv("data.csv")

希望这会有所帮助！

这篇关于如何在Spark中创建架构文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Spark中创建架构文件 [英] How to create a Schema file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Spark中创建架构文件 [英] How to create a Schema file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭