如何使用Spark解析jsonfile [英] How to parse jsonfile with spark

查看:168
本文介绍了如何使用Spark解析jsonfile的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要解析的json文件.json格式如下:

I have a jsonfile to be parsed.The json format is like this :

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}

我必须获取文件中的每个单词.如何从数组中获取"major",是否必须使用方法df.select("cv_parse.basic_info.location.province")获取"province"一词?

I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?

这是我想要的结果:

cv_id   major   degree  birthyear   state
001   English   Bachelor  1984     New York
001   English   Master    1984     New York

推荐答案

这可能不是最好的方法,但是您可以试一下.

This might not be the best way of doing it but you can give it a shot.

// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._

//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")

jsonDf.printSchema

您的架构为:

root
 |-- cv_id: string (nullable = true)
 |-- cv_parse: struct (nullable = true)
 |    |-- basic_info: struct (nullable = true)
 |    |    |-- birthyear: string (nullable = true)
 |    |    |-- location: struct (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |-- educations: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- degree: string (nullable = true)
 |    |    |    |-- major: string (nullable = true)

现在您需要爆炸educations

 val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
      $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")

  explodedResult.printSchema

现在您的架构将是

 root
 |-- cv_id: string (nullable = true)
 |-- col: struct (nullable = true)
 |    |-- degree: string (nullable = true)
 |    |-- major: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- state: string (nullable = true)

现在您可以选择列

explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show

+-----+---------+--------+--------+-------+
|cv_id|birthyear|   state|  degree|  major|
+-----+---------+--------+--------+-------+
|  001|     1984|New York|Bachelor|English|
|  001|     1984|New York| Master |English|
+-----+---------+--------+--------+-------+

这篇关于如何使用Spark解析jsonfile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆