将外部json文件读入RDD并在Scala中提取特定值 [英] Read external json file into RDD and extract specific values in scala

查看:390
本文介绍了将外部json文件读入RDD并在Scala中提取特定值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我对scala和spark完全陌生,尽管与pyspark有点相似.我正在使用非常庞大的外部json文件,并且不允许将其转换为数据集或数据框.我必须在纯RDD上执行操作.

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to convert it into dataset or dataframe. I have to perform operations on pure RDD.

所以我想知道如何获取密钥的特定值.所以我将json文件读为sc.textFile("information.json")现在通常在python中我想

So I wanted to know how can I get specific value of key. So I read my json file as sc.textFile("information.json") Now normally in python I would do like

x = sc.textFile("information.json").map(lambda x: json.loads(x))\ 
 .map(lambda x: (x['name'],x['roll_no'])).collect()

在RDD中的scala(提取特定键的值)中是否有上述代码的等效形式,而没有转换为数据框或数据集.

is there any equivalent of above code in scala (Extracting value of specific keys) in RDD without converting to dataframe or dataset.

基本上与> pyspark的json.loads函数具有相同的问题, shell ,但希望能得到更具体,更友好的答案.谢谢

Essentially same question as Equivalent pyspark's json.loads function for spark-shell but hoping to get more concrete and noob friendly answer. Thank you

Json数据: {"name":"ABC", "roll_no":"12", "Major":"CS"}

推荐答案

选项1:RDD API + json4s lib

一种方法是使用

One way is using the json4s library. The library is already used internally by Spark.

import org.json4s._
import org.json4s.jackson.JsonMethods._

// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"

val rdd = sc.textFile(file_location)

rdd.map{ row =>
  val json_row = parse(row)

  (compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}

// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")

首先,我们将行数据解析为json_row,然后使用运算符\来访问该行的属性,即:json_row \ "name".最终结果是name,roll_no

First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

选项2:数据框API + get_json_object()

更直接的方法是将数据框API与get_json_object()函数结合使用.

And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

import org.apache.spark.sql.functions.get_json_object

val df = spark.read.text(file_location)

df.select(
  get_json_object($"value","$.name").as("name"),
  get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}

// [ABC1,12]
// [ABC2,13]
// [ABC3,14]

这篇关于将外部json文件读入RDD并在Scala中提取特定值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆