Spark数据帧将嵌套的JSON转换为单独的列 [英] Spark dataframes convert nested JSON to seperate columns
问题描述
我有一个具有以下结构的JSON流,这些结构已转换为数据框
I've a stream of JSONs with following structure that gets converted to dataframe
{
"a": 3936,
"b": 123,
"c": "34",
"attributes": {
"d": "146",
"e": "12",
"f": "23"
}
}
数据框显示功能导致以下输出
The dataframe show functions results in following output
sqlContext.read.json(jsonRDD).show
+----+-----------+---+---+
| a| attributes| b| c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+
如何将属性列(嵌套的JSON结构)分为 attributes.d,attributes.e和attributes.f ,作为 seperate 列到一个新的数据框中,所以可以在新数据框中具有a,b,c,attributes.d,attributes.e和attributes.f列?
How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e and attributes.f in the new dataframe?
推荐答案
-
如果要将列从
a
命名为f
:df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
-
如果要使用以
attributes.
前缀命名的列: If you want columns named with
attributes.
prefix:df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")
-
如果您的列名是从外部来源(例如配置)提供的:
If names of your columns are supplied from an external source (e.g. configuration):
val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f") df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)
这篇关于Spark数据帧将嵌套的JSON转换为单独的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!