如何在Spark中使用from_json()数据帧? [英] How do I use a from_json() dataframe in Spark?
问题描述
我正在尝试根据Databricks 3.5(Spark 2.2.1)中数据帧内的json-string创建数据集。在 jsonSchema下面的代码块中是一个StructType,它具有json-string正确的布局,该布局位于数据框的 body列中。
I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
这将返回根对象所在的数据帧
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
后跟架构中的字段(看起来正确)。当我在newDF上尝试另一个选择时
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
它会引发异常
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
我显然错过了一些东西。我希望from_json将返回一个可以进一步操作的数据框。
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
我的最终目标是将oldDF正文列中的json-string转换为数据集。
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
推荐答案
from_json
返回结构
或( array< struct< ...> c)列。这意味着它是一个嵌套对象。如果您提供了有意义的名称:
from_json
returns a struct
or (array<struct<...>>
) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
,该模式描述了普通的 struct
您可以使用标准方法,例如
and the schema describes a plain struct
you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
否则,请按照说明访问数组。
otherwise please follow the instructions for accessing arrays.
这篇关于如何在Spark中使用from_json()数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!