如何在Spark中向数据集添加架构? [英] How to add a schema to a Dataset in Spark?

查看:67
本文介绍了如何在Spark中向数据集添加架构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将文件加载到spark中.如果我将普通的textFile加载到Spark中,如下所示:

I am trying to load a file into spark. If I load a normal textFile into Spark like below:

val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")

结果是:

partFile: org.apache.spark.sql.Dataset[String] = [value: string]

我可以在输出中看到一个数据集.但是,如果我加载一个Json文件:

I can see a dataset in the output. But if I load a Json file:

val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")

结果是具有现成架构的数据框:

The outcome is a dataframe with a readymade schema:

pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]

Json/parquet/orc文件具有架构.因此,我可以理解这是Spark版本:2x的一项功能,由于在这种情况下我们直接获得了DataFrame,因此事情变得更加容易,对于普通的textFile,您将获得没有有意义的模式的数据集.我想知道的是如何将架构添加到数据集中,这是将textFile加载到spark中的结果.对于RDD,有case class/StructType选项可以添加架构并将其转换为DataFrame.谁能让我知道我该怎么办?

The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense. What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame. Could anyone let me know how can I do it ?

推荐答案

使用 textFile 时,文件的每一行将是数据集中的一个字符串行.要使用模式转换为DataFrame,可以使用 toDF :

When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:

val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")

import sqlContext.implicits._
val df = partFile.toDF("string_column")

在这种情况下,DataFrame将具有类型为StringType的单列的架构.

In this case, the DataFrame will have a schema of a single column of type StringType.

如果文件包含更复杂的架构,则可以使用csv阅读器(如果文件采用结构化的csv格式):

If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):

val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")

或者您可以使用map处理数据集,然后使用 toDF 转换为DataFrame.例如,假设您希望其中一列是行的第一个字符(作为一个Int),而另一列是该行的第四个字符(也作为一个Int):

Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):

val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")

val processedDataset: Dataset[(Int, Int)] = partFile.map {
  line: String => (line(0).toInt, line(3).toInt)
}

import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")

此外,您可以定义一个case类,它将代表您的DataFrame的最终架构:

Also, you can define a case class, which will represent the final schema for your DataFrame:

case class MyRow(value0: Int, value3: Int)

val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")

val processedDataset: Dataset[MyRow] = partFile.map {
  line: String => MyRow(line(0).toInt, line(3).toInt)
}

import sqlContext.implicits._
val df = processedDataset.toDF

在上述两种情况下,调用 df.printSchema 都会显示:

In both cases above, calling df.printSchema would show:

root
 |-- value0: integer (nullable = true)
 |-- value3: integer (nullable = true)

这篇关于如何在Spark中向数据集添加架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆