对从 JSON 创建的数据帧应用过滤条件 [英] Apply filter condition on dataframe created from JSON
问题描述
我正在处理由 JSON 创建的数据框,然后我想对数据框应用过滤条件.
I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.
val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
df 的架构
root
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: long (nullable = true)
| | |-- value: long (nullable = true)
现在我需要过滤我想要做的数据框
now I need to filter the dataframe which I am trying to do as
val df1=df.where("key == 84896")
抛出错误
ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)
我之所以要使用where子句,是因为我想直接使用的表达式字符串例如 ( (key == 999, value == 55) || (key == 1234, value == 12) )
The reason I want to use where clause is because of the expression string which I want to use directly
eg ( (key == 999, value == 55) || (key == 1234, value == 12) )
推荐答案
从我从您的问题和评论中了解到的是您正在尝试应用 ( (key == 999, value ==55) || (key == 1234, value == 12) )
表达式来过滤数据框行.
From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) )
expression to filter the dataframe rows.
首先,表达式需要改变,因为它不能作为表达式应用于spark中的dataframe
,所以你需要改变为
First of all, the expression needs changes as it cannot be applied as expression to dataframe
in spark so you need to change as
val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")
这应该给你新的有效表达式作为
( (key == 999 and value == 55) or (key == 1234 and value == 12) )
既然你有有效的表达式,你的dataframe
也需要修改,因为你不能用array
和array
查询这样的表达式struct
作为模式
Now that you have valid expression, your dataframe
needs modification too as you can't query such expression on a column with array
and struct
as schema
所以你需要 explode
函数来explode array
元素到不同的行,然后使用 .*
选择不同列上 struct
的所有元素的符号.
So you would need explode
function to explode the array
elements to different rows and then use .*
notation to select all the elements of struct
on different columns.
val df1 = df.withColumn("metadata", explode($"metadata"))
.select($"metadata.*")
这应该给你 dataframe
作为
+-----+-----+
|key |value|
+-----+-----+
|84896|54 |
|1234 |12 |
+-----+-----+
最后在生成为
df1.where(s"${actualExpression}")
希望回答对你有帮助
这篇关于对从 JSON 创建的数据帧应用过滤条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!