如何从CSV文件创建架构并将其持久保存/保存到文件中? [英] How to create a schema from CSV file and persist/save that schema to a file?

查看:106
本文介绍了如何从CSV文件创建架构并将其持久保存/保存到文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有10列的CSV文件. Half String和Half是Integers.

I have CSV file with 10 columns. Half String and half are Integers.

Scala代码是什么:

What is the Scala code to:

  • 创建(推断)架构
  • 将该架构保存到文件

我到目前为止有这个:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

保存该架构的最佳文件格式是什么?是JSON吗?

And what is the best file format for saving that schema? Is it JSON?

目标是-我只想创建一次架构,下一次从文件中加载,而不是即时重新创建.

Goal is - I want to create schema only once and next time load from a file instead of re-creating it on a fly.

谢谢.

推荐答案

DataType API提供了所有必需的实用程序,因此JSON是自然选择:

DataType API provided all the required utilities so JSON is a natural choice:

import org.apache.spark.sql.types._
import scala.util.Try

val df = Seq((1L, "foo", 3.0)).toDF("id", "x1", "x2")
val serializedSchema: String = df.schema.json


def loadSchema(s: String): Option[StructType] =
  Try(DataType.fromJson(s)).toOption.flatMap {
    case s: StructType => Some(s)
    case _ => None 
  }

loadSchema(serializedSchema)

根据您的要求,您可以使用标准Scala方法将其写入文件,或入侵Spark RDD:

Depending on you requirements you can use standard Scala methods to write this to file, or hack Spark RDD:

val schemaPath: String = ???

sc.parallelize(Seq(serializedSchema), 1).saveAsTextFile(schemaPath)
val loadedSchema: Option[StructType] = sc.textFile(schemaPath)
  .map(loadSchema)  // Load
  .collect.headOption.flatten  // Make sure we don't fail if there is no data

有关Python等效项,请参见配置文件以在PySpark中定义JSON模式结构

For a Python equivalent see Config file to define JSON Schema Struture in PySpark

这篇关于如何从CSV文件创建架构并将其持久保存/保存到文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆