即使列不在数据框中,Spark 也会向下推过滤器 [英] Spark is pushing down a filter even when the column is not in the dataframe
问题描述
我有一个带有列的 DataFrame
:
field1、field1_name、field3、field5、field4、field2、field6
我选择它是为了只保留field1、field2、field3、field4
.注意select后没有field5
.
I am selecting it so that I only keep field1, field2, field3, field4
. Note that there is no field5
after the select.
之后,我有一个使用 field5
的过滤器,我希望它会抛出分析错误,因为该列不存在,而是过滤原始 DataFrame代码>(在选择之前)因为它正在向下推过滤器,如下所示:
After that, I have a filter that uses field5
and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame
(before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Physical Plan ==
*Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...
正如你所看到的,物理计划在项目之前有过滤器......这是预期的行为吗?我希望会出现分析异常...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
问题的可重现示例:
val df = Seq(
("", "", "")
).toDF("field1", "field2", "field3")
val selected = df.select("field1", "field2")
val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
shouldFail.show()
输出:
+------+------+
|field1|field2|
+------+------+
+------+------+
推荐答案
Dataset/Dataframe 描述了您观察到的原因:
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"数据集是懒惰的",即只有在调用操作时才会触发计算.在内部,数据集代表一个逻辑计划,描述生成数据所需的计算.当调用一个操作时,Spark 的查询优化器会优化逻辑计划并生成一个物理计划,以便并行和高效地执行分布式方式."
"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
重要部分以粗体突出显示.当应用 select
和 filter
语句时,它只会被添加到一个逻辑计划中,当应用一个动作时,它只会被 Spark 解析.在解析这个完整逻辑计划时,Catalyst Optimizer 会查看整个计划,优化规则之一是下推过滤器,这就是您在示例中看到的内容.
The important part is highlighted in bold. When applying select
and filter
statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
我认为这是一个很棒的功能.即使您对在最终 Dataframe 中看到此特定字段不感兴趣,它也理解您对某些原始数据不感兴趣.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
这是 Spark SQL 引擎相对于 RDD 的主要优势.它了解什么您正在尝试做,而不会被告知如何去做.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.
这篇关于即使列不在数据框中,Spark 也会向下推过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!