即使列不在数据框中，Spark 也会向下推过滤器 [英] Spark is pushing down a filter even when the column is not in the dataframe

查看：30 发布时间：2021/7/15 19:59:07 scala apache-spark

本文介绍了即使列不在数据框中，Spark 也会向下推过滤器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有列的 DataFrame:

field1、field1_name、field3、field5、field4、field2、field6

我选择它是为了只保留field1、field2、field3、field4.注意select后没有field5.

I am selecting it so that I only keep field1, field2, field3, field4. Note that there is no field5 after the select.

之后，我有一个使用 field5 的过滤器，我希望它会抛出分析错误，因为该列不存在，而是过滤原始 DataFrame(在选择之前)因为它正在向下推过滤器，如下所示:


After that, I have a filter that uses field5 and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame (before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv

== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv

== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv

== Physical Plan ==
  *Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...

正如你所看到的，物理计划在项目之前有过滤器......这是预期的行为吗?我希望会出现分析异常...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
问题的可重现示例:
val df = Seq(
      ("", "", "")
    ).toDF("field1", "field2", "field3")

    val selected = df.select("field1", "field2")
    val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
    shouldFail.show()

输出:
+------+------+
|field1|field2|
+------+------+
+------+------+


推荐答案
Dataset/Dataframe 描述了您观察到的原因:
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"数据集是懒惰的"，即只有在调用操作时才会触发计算.在内部，数据集代表一个逻辑计划，描述生成数据所需的计算.当调用一个操作时，Spark 的查询优化器会优化逻辑计划并生成一个物理计划，以便并行和高效地执行分布式方式."

"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
重要部分以粗体突出显示.当应用 select 和 filter 语句时，它只会被添加到一个逻辑计划中，当应用一个动作时，它只会被 Spark 解析.在解析这个完整逻辑计划时，Catalyst Optimizer 会查看整个计划，优化规则之一是下推过滤器，这就是您在示例中看到的内容.
The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
我认为这是一个很棒的功能.即使您对在最终 Dataframe 中看到此特定字段不感兴趣，它也理解您对某些原始数据不感兴趣.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
这是 Spark SQL 引擎相对于 RDD 的主要优势.它了解什么您正在尝试做，而不会被告知如何去做.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.

                        这篇关于即使列不在数据框中，Spark 也会向下推过滤器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

即使列不在数据框中，Spark 也会向下推过滤器 [英] Spark is pushing down a filter even when the column is not in the dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

即使列不在数据框中，Spark 也会向下推过滤器 [英] Spark is pushing down a filter even when the column is not in the dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭