Apache Spark使用额外的列读取JSON [英] Apache Spark Read JSON With Extra Columns
问题描述
我正在读取具有两列id
和jsonString
的Hive表.我可以轻松地将jsonString
转换为调用spark.read.json
函数的Spark数据结构,但是我还必须添加列id
.
I'm reading a Hive table which has two columns, id
and jsonString
. I can easily transform the jsonString
into a Spark Data Structure calling the spark.read.json
function, but I have to add the column id
as well.
val jsonStr1 = """{"fruits":[{"fruit":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr2 = """{"fruits":[{"dt":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr3 = """{"fruits":[{"a":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
case class Foo(id: Integer, json: String)
val ds = Seq(new Foo(1,jsonStr1), new Foo(2,jsonStr2), new Foo(3,jsonStr3)).toDS
val jsonDF = spark.read.json(ds.select($"json").rdd.map(r => r.getAs[String](0)).toDS)
jsonDF.show()
jsonDF.show
+--------------------+------------------+------------------+--------------------+
| bar| cars| daniel| fruits|
+--------------------+------------------+------------------+--------------------+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|
+--------------------+------------------+------------------+--------------------+
我想从Hive表中添加列id
,如下所示:
I would like to add the column id
from the Hive table, like this:
+--------------------+------------------+------------------+--------------------+---------------
| bar| cars| daniel| fruits| id
+--------------------+------------------+------------------+--------------------+--------------
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|1
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|2
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|3
+--------------------+------------------+------------------+--------------------+
我将不使用正则表达式
我创建了一个udf,它将这两个字段用作参数,并使用适当的JSON库包含所需的field(id)
并返回新的JSON字符串.它的工作原理很吸引人,但是我希望Spark API提供了一种更好的方法.我正在使用Apache Spark 2.3.0.
I created a udf which take this two fields as argument and using a proper JSON library include the desired field(id)
and return a new JSON string. It works like a charm but I hope Spark API offers a better way to do it. I'm using Apache Spark 2.3.0.
推荐答案
我以前已经知道过from_json
函数,但是就我而言,手动推断每个JSON的模式是不可能的".我当时在想Spark会有一个惯用的"界面.
I already knew about the from_json
function before, but in my case it would be "impossible" to manually infer the schema for each JSON. I was thinking that Spark would have an "idiomatic" interface.
这是我的最终解决方案:
This is my final solution:
ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show
ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show
+--------------------+------------------+------------------+--------------------+---+
| bar| cars| daniel| fruits| id|
+--------------------+------------------+------------------+--------------------+---+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...| 1|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...| 2|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...| 3|
+--------------------+------------------+------------------+--------------------+---+
这篇关于Apache Spark使用额外的列读取JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!