如何使用scala将csv字符串解析为Spark数据帧? [英] How to parse a csv string into a Spark dataframe using scala?
问题描述
我想将包含字符串记录的RDD
(如下所示)转换为Spark数据帧.
I would like to convert a RDD
containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
架构行不在同一个RDD
中,而是在另一个变量中:
The schema line is not inside the same RDD
, but in a another variable:
val header = "name,account,state,age"
所以现在我的问题是,如何使用以上两个在Spark中创建数据框?我正在使用Spark 2.2版.
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
我确实进行了搜索,并看到了一条信息: 我可以吗使用spark-csv将表示为字符串的CSV读取到Apache Spark中 . 但是,这并不是我真正需要的,我无法找到一种方法来修改这段代码以适合我的情况.
I did search and saw a post: Can I read a CSV represented as a string into Apache Spark using spark-csv . However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
非常感谢您的帮助.
推荐答案
更简单的方法可能是从CSV文件开始并将其直接读取为数据框(通过指定架构).您可以在此处看到示例:在以一个数据框.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
当数据已存在于RDD中时,可以使用toDF()
转换为数据框.此函数还接受列名称作为输入.要使用此功能,请首先使用SparkSession
对象导入火花隐式对象:
When the data already exists in an RDD you can use toDF()
to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession
object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
由于RDD包含字符串,因此需要首先将其转换为表示数据帧中列的元组.在这种情况下,这将是RDD[(String, String, String, Int)]
,因为有四列(最后一个age
列已更改为int以说明如何完成此操作).
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)]
since there are four columns (the last age
column is changed to int to illustrate how it can be done).
假设输入数据在rdd
中:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
结果数据框:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+
这篇关于如何使用scala将csv字符串解析为Spark数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!