为什么Spark Mongo连接器不向下推过滤器? [英] Why doesn't Spark Mongo connector push down filters?

查看:128
本文介绍了为什么Spark Mongo连接器不向下推过滤器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型Mongo集合,我想使用Spark Mongo连接器在我的Spark应用程序中使用。集合很大(> 10 GB),并且具有每日数据,在 original_item.CreatedDate 字段上具有索引。在Mongo中选择几天的查询非常快(不到一秒钟)。但是,当我使用数据帧编写相同的查询时,该过滤器没有下推到Mongo,导致性能极慢,因为Spark显然会获取整个集合并进行过滤。

I have a large Mongo collection that I want to use in my Spark application, using Spark Mongo connector. The collection is quite large (>10 GB) and has daily data, with an index on original_item.CreatedDate field. Queries to select a couple of days in Mongo are extremely fast (under a second). However when I write the same query using dataframes, that filter is not pushed down to Mongo, resulting in extremely slow performance as Spark apparently fetches entire collection and does filtering itself.

查询看起来是这样的:

collection
      .filter("original_item.CreatedDate  > %s" % str(start_date_timestamp_ms)) \
      .filter("original_item.CreatedDate  < %s" % str(end_date_timestamp_ms)) \
      .select(...)

在物理计划中,我看到:
PushedFilters:[IsNotNull(original_item)]

In physical plan I see: PushedFilters: [IsNotNull(original_item)]

当我使用类似的查询对集合的另一个字段进行过滤时,mongo成功将其下推- PushedFilters:[IsNotNull(original_item), IsNotNull(doc_type),EqualTo(doc_type,case)]

When I make a similar query with filtering on another field of that collection, mongo successfully pushes it down - PushedFilters: [IsNotNull(original_item), IsNotNull(doc_type), EqualTo(doc_type,case)]!

是否可能是 GreaterThan 过滤器推送不受Mongo Spark连接器支持,或者存在

Could it be the case that GreaterThan filter pushing is not supported by Mongo Spark connector, or that there is a bug with it?

谢谢!

推荐答案

不是 GreaterThan 导致了您的问题,这是因为过滤器位于嵌套字段中。您对 doc_type 进行的过滤器有效,因为它没有嵌套。这显然是Spark中的Catalyst引擎而不是Mongo连接器的问题。它还会影响Parquet等谓词下推。

It's not the GreaterThan that's causing your issue, it's the fact that the filter is on a nested field. Your filter on doc_type works because it's not nested. This, apparently is an issue with the Catalyst engine in Spark, not the Mongo connector. It affects predicate pushdowns in, e.g., Parquet as well.

有关更多详细信息,请参见Spark Jira中的以下讨论。

See the following discussions in the Spark Jira for more details.

火花19638

火花17636

这篇关于为什么Spark Mongo连接器不向下推过滤器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆