如何从 Spark 中的文本文件创建 DataFrame [英] How to create a DataFrame from a text file in Spark
问题描述
我在 HDFS 上有一个文本文件,我想将其转换为 Spark 中的数据帧.
I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
我正在使用 Spark 上下文加载文件,然后尝试从该文件生成各个列.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
这样做后,我正在尝试以下操作.
After doing this, I am trying the following operation.
myFile1.toDF()
我遇到了一个问题,因为 myFile1 RDD 中的元素现在是数组类型.
I am getting an issues since the elements in myFile1 RDD are now array type.
我该如何解决这个问题?
How can I solve this issue?
推荐答案
更新 - 从 Spark 1.6 开始,您可以简单地使用内置的 csv 数据源:
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
您还可以使用各种选项来控制 CSV 解析,例如:
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
对于 Spark 版本 <1.6:最简单的方法是使用 spark-csv - 将其包含在您的依赖项中并按照自述文件进行操作允许设置自定义分隔符 (;
),可以读取 CSV 标头(如果有的话),并且可以推断架构 类型(额外扫描数据).
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;
), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
或者,如果您知道架构,您可以创建一个代表它的案例类,并将您的 RDD 元素映射到此类的实例,然后再转换为数据帧,例如:
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
这篇关于如何从 Spark 中的文本文件创建 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!