如何在Spark中创建架构文件 [英] How to create a Schema file in Spark
问题描述
我正在尝试读取架构文件(它是文本文件),并将其应用于没有标题的CSV文件.由于我已经有一个架构文件,因此我不想使用InferSchema
选项,这会增加开销.
I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema
option which is an overhead.
我的输入模式文件如下所示,
My input schema file looks like below,
"num IntegerType","letter StringType"
我正在尝试下面的代码来创建模式文件,
I am trying the below code to create a schema file,
val schema_file = spark.read.textFile("D:\\Users\\Documents\\schemaFile.txt")
val struct_type = schema_file.flatMap(x => x.split(",")).map(b => (b.split(" ")(0).stripPrefix("\"").asInstanceOf[String],b.split(" ")(1).stripSuffix("\"").asInstanceOf[org.apache.spark.sql.types.DataType])).foreach(x=>println(x))
我收到如下错误
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.DataType
-字段(类:"org.apache.spark.sql.types.DataType",名称:"_ 2") -根类:"scala.Tuple2"
- field (class: "org.apache.spark.sql.types.DataType", name: "_2") - root class: "scala.Tuple2"
并尝试使用它作为架构文件,同时使用如下所示的spark.read.csv
并将其写为ORC文件
and trying to use this as a schema file while using spark.read.csv
like below and write it as an ORC file
val df=spark.read
.format("org.apache.spark.csv")
.option("header", false)
.option("inferSchema", true)
.option("samplingRatio",0.01)
.option("nullValue", "NULL")
.option("delimiter","|")
.schema(schema_file)
.csv("D:\\Users\\sampleFile.txt")
.toDF().write.format("orc").save("D:\\Users\\ORC")
需要帮助将文本文件转换为架构文件并将输入的CSV文件转换为ORC.
Need help to convert a text file into a schema file and convert my input CSV file to ORC.
推荐答案
要从text
文件创建模式,请创建match
type
的函数,并以
To create a schema from a text
file create a function to match
the type
and return DataType
as
def getType(raw: String): DataType = {
raw match {
case "ByteType" => ByteType
case "ShortType" => ShortType
case "IntegerType" => IntegerType
case "LongType" => LongType
case "FloatType" => FloatType
case "DoubleType" => DoubleType
case "BooleanType" => BooleanType
case "TimestampType" => TimestampType
case _ => StringType
}
}
现在通过将模式文件读取为
Now create a schema by reading a schema file as
val schema = Source.fromFile("schema.txt").getLines().toList
.flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" "))
.map(x => StructField(x(0), getType(x(1)), true))
现在将CSV文件读取为
Now read the csv file as
spark.read
.option("samplingRatio", "0.01")
.option("delimiter", "|")
.option("nullValue", "NULL")
.schema(StructType(schema))
.csv("data.csv")
希望这会有所帮助!
这篇关于如何在Spark中创建架构文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!