Spark-SQL:如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构? [英] Spark-SQL : How to read a TSV or CSV file into dataframe and apply a custom schema?

查看:41
本文介绍了Spark-SQL:如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在处理制表符分隔值 (TSV) 和逗号分隔值 (CSV) 文件时使用 Spark 2.0.我想将数据加载到 Spark-SQL 数据帧中,我想在读取文件时完全控制架构.我不希望 Spark 从文件中的数据中猜测架构.

I'm using Spark 2.0 while working with tab-separated value (TSV) and comma-separated value (CSV) files. I want to load the data into Spark-SQL dataframes, where I would like to control the schema completely when the files are read. I don't want Spark to guess the schema from the data in the file.

如何将 TSV 或 CSV 文件加载到 Spark SQL Dataframes 中并对其应用架构?

How would I load TSV or CSV files into Spark SQL Dataframes and apply a schema to them?

推荐答案

以下是加载制表符分隔值 (TSV) 文件并应用架构的完整 Spark 2.0 示例.

Below is a complete Spark 2.0 example of loading a tab-separated value (TSV) file and applying a schema.

我使用的是 来自 UAH.edu 的 TSV 格式的 Iris 数据集 为例.以下是该文件的前几行:

I'm using the Iris data set in TSV format from UAH.edu as an example. Here are the first few rows from that file:

Type    PW      PL      SW      SL
0       2       14      33      50
1       24      56      31      67
1       23      51      31      69
0       2       10      36      46
1       20      52      30      65

要强制执行模式,您可以使用以下两种方法之一以编程方式构建它:

To enforce a schema, you can programmatically build it using one of two methods:

A.使用 StructType 创建架构:

A. Create the schema with StructType:

import org.apache.spark.sql.types._

var irisSchema = StructType(Array(
    StructField("Type",         IntegerType, true),
    StructField("PetalWidth",   IntegerType, true),
    StructField("PetalLength",  IntegerType, true),
    StructField("SepalWidth",   IntegerType, true),
    StructField("SepalLength",  IntegerType, true)
    ))

B.或者,使用 case classEncoders 创建模式(这种方法不那么冗长):

B. Alternatively, create the schema with a case class and Encoders (this approach is less verbose):

import org.apache.spark.sql.Encoders

case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, 
                      SepalWidth: Int, SepalLength: Int)

var irisSchema = Encoders.product[IrisSchema].schema

创建架构后,您可以使用 spark.read 读取 TSV 文件.请注意,您实际上也可以读取逗号分隔值 (CSV) 文件或任何分隔文件,只要您正确设置了 option("delimiter", d) 选项.此外,如果您的数据文件有标题行,请务必设置option("header", "true").

Once you have created your schema, you can use spark.read to read in the TSV file. Note that you can actually also read comma-separated value (CSV) files as well, or any delimited files, as long as you set the option("delimiter", d) option correctly. Further, if you have a data file that has a header line, be sure to set option("header", "true").

以下是完整的最终代码:

Below is the complete final code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders

val spark = SparkSession.builder().getOrCreate()

case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
                      SepalWidth: Int, SepalLength: Int)

var irisSchema = Encoders.product[IrisSchema].schema

var irisDf = spark.read.format("csv").     // Use "csv" regardless of TSV or CSV.
                option("header", "true").  // Does the file have a header line?
                option("delimiter", "\t"). // Set delimiter to tab or comma.
                schema(irisSchema).        // Schema that was built above.
                load("iris.tsv")

irisDf.show(5)

这是输出:

scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
|   0|         2|         14|        33|         50|
|   1|        24|         56|        31|         67|
|   1|        23|         51|        31|         69|
|   0|         2|         10|        36|         46|
|   1|        20|         52|        30|         65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows

这篇关于Spark-SQL:如何将 TSV 或 CSV 文件读入数据帧并应用自定义架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆