在通过JSON创建的数据框上应用过滤条件 [英] Apply filter condition on dataframe created from JSON
问题描述
我正在处理由JSON创建的数据框,然后要将过滤条件应用于该数据框.
I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.
val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
df模式
root
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: long (nullable = true)
| | |-- value: long (nullable = true)
现在我需要过滤要尝试做的数据框
now I need to filter the dataframe which I am trying to do as
val df1=df.where("key == 84896")
会引发错误
ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)
我想使用where子句的原因是因为我想直接使用表达式字符串
例如( (key == 999, value == 55) || (key == 1234, value == 12) )
The reason I want to use where clause is because of the expression string which I want to use directly
eg ( (key == 999, value == 55) || (key == 1234, value == 12) )
推荐答案
从您的问题和评论中我了解到,您正在尝试应用( (key == 999, value == 55) || (key == 1234, value == 12) )
表达式来过滤数据框行.
From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) )
expression to filter the dataframe rows.
首先,表达式需要更改,因为不能将其用作 spark 中的dataframe
表达式,因此您需要更改为
First of all, the expression needs changes as it cannot be applied as expression to dataframe
in spark so you need to change as
val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")
应该为您提供新的有效表达式作为
( (key == 999 and value == 55) or (key == 1234 and value == 12) )
现在您有了有效表达式,您的dataframe
也需要修改,因为您无法在以array
和struct
作为模式的列上查询这样的表达式
Now that you have valid expression, your dataframe
needs modification too as you can't query such expression on a column with array
and struct
as schema
因此,您需要explode
函数将array
元素分解到不同的行,然后使用.*
表示法选择不同列上的struct
的所有元素.
So you would need explode
function to explode the array
elements to different rows and then use .*
notation to select all the elements of struct
on different columns.
val df1 = df.withColumn("metadata", explode($"metadata"))
.select($"metadata.*")
应该为您提供dataframe
作为
+-----+-----+
|key |value|
+-----+-----+
|84896|54 |
|1234 |12 |
+-----+-----+
最后在生成为
df1.where(s"${actualExpression}")
我希望答案会有所帮助
这篇关于在通过JSON创建的数据框上应用过滤条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!