在通过JSON创建的数据框上应用过滤条件 [英] Apply filter condition on dataframe created from JSON

查看：128 发布时间：2020/9/4 4:59:00 scala apache-spark apache-spark-sql

本文介绍了在通过JSON创建的数据框上应用过滤条件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理由JSON创建的数据框，然后要将过滤条件应用于该数据框.

I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.

val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)

df模式

root
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: long (nullable = true)
 |    |    |-- value: long (nullable = true)

现在我需要过滤要尝试做的数据框

now I need to filter the dataframe which I am trying to do as

val df1=df.where("key == 84896")

会引发错误

ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)

我想使用where子句的原因是因为我想直接使用表达式字符串例如( (key == 999, value == 55) || (key == 1234, value == 12) )

The reason I want to use where clause is because of the expression string which I want to use directly eg ( (key == 999, value == 55) || (key == 1234, value == 12) )

推荐答案

从您的问题和评论中我了解到，您正在尝试应用( (key == 999, value == 55) || (key == 1234, value == 12) )表达式来过滤数据框行.

From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) ) expression to filter the dataframe rows.

首先，表达式需要更改，因为不能将其用作 spark 中的dataframe表达式，因此您需要更改为

First of all, the expression needs changes as it cannot be applied as expression to dataframe in spark so you need to change as

val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")

应该为您提供新的有效表达式作为

( (key == 999 and value == 55) or (key == 1234 and value == 12) )

现在您有了有效表达式，您的dataframe也需要修改，因为您无法在以array和struct作为模式的列上查询这样的表达式

Now that you have valid expression, your dataframe needs modification too as you can't query such expression on a column with array and struct as schema

因此，您需要explode函数将array元素分解到不同的行，然后使用.*表示法选择不同列上的struct的所有元素.

So you would need explode function to explode the array elements to different rows and then use .* notation to select all the elements of struct on different columns.

val df1 = df.withColumn("metadata", explode($"metadata"))
  .select($"metadata.*")

应该为您提供dataframe作为

+-----+-----+
|key  |value|
+-----+-----+
|84896|54   |
|1234 |12   |
+-----+-----+

最后在生成为

df1.where(s"${actualExpression}")

我希望答案会有所帮助

这篇关于在通过JSON创建的数据框上应用过滤条件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在通过JSON创建的数据框上应用过滤条件 [英] Apply filter condition on dataframe created from JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在通过JSON创建的数据框上应用过滤条件 [英] Apply filter condition on dataframe created from JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭