对从 JSON 创建的数据帧应用过滤条件 [英] Apply filter condition on dataframe created from JSON

查看:20
本文介绍了对从 JSON 创建的数据帧应用过滤条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理由 JSON 创建的数据框,然后我想对数据框应用过滤条件.

I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.

val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)

df 的架构

root
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: long (nullable = true)
 |    |    |-- value: long (nullable = true)

现在我需要过滤我想要做的数据框

now I need to filter the dataframe which I am trying to do as

val df1=df.where("key == 84896")

抛出错误

ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)

我之所以要使用where子句,是因为我想直接使用的表达式字符串例如 ( (key == 999, value == 55) || (key == 1234, value == 12) )

The reason I want to use where clause is because of the expression string which I want to use directly eg ( (key == 999, value == 55) || (key == 1234, value == 12) )

推荐答案

从我从您的问题和评论中了解到的是您正在尝试应用 ( (key == 999, value ==55) || (key == 1234, value == 12) ) 表达式来过滤数据框行.

From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) ) expression to filter the dataframe rows.

首先,表达式需要改变,因为它不能作为表达式应用于spark中的dataframe,所以你需要改变为

First of all, the expression needs changes as it cannot be applied as expression to dataframe in spark so you need to change as

val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")

这应该给你新的有效表达式作为

( (key == 999 and value == 55) or (key == 1234 and value == 12) )

既然你有有效的表达式,你的dataframe也需要修改,因为你不能用arrayarray查询这样的表达式struct 作为模式

Now that you have valid expression, your dataframe needs modification too as you can't query such expression on a column with array and struct as schema

所以你需要 explode 函数来explode array 元素到不同的行,然后使用 .*选择不同列上 struct 的所有元素的符号.

So you would need explode function to explode the array elements to different rows and then use .* notation to select all the elements of struct on different columns.

val df1 = df.withColumn("metadata", explode($"metadata"))
  .select($"metadata.*")

这应该给你 dataframe 作为

+-----+-----+
|key  |value|
+-----+-----+
|84896|54   |
|1234 |12   |
+-----+-----+

最后在生成为

df1.where(s"${actualExpression}")

希望回答对你有帮助

这篇关于对从 JSON 创建的数据帧应用过滤条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆