后置过滤器和全面聚合的分面搜索有什么区别? [英] What differs between post-filter and global aggregation for faceted search?

查看:135
本文介绍了后置过滤器和全面聚合的分面搜索有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索界面中的常见问题是您希望返回一组结果
,但可能希望返回有关所有文档的信息。 (例如,我想看到所有的红色衬衫,但想知道什么
其他颜色可用)。



这有时被称为分面结果 ,或
分面导航。 弹性搜索引用的示例在解释为什么/如何,非常清楚,所以
我用这个作为这个问题的基础。



总结/问题:看来我可以同时使用后置过滤器或全局聚合。他们都似乎以
的方式提供完全相同的功能。他们可能有优势或劣势,我没有看到b $ b?如果是这样,我应该使用哪一个?



下面列出了一些完整的例子,一些文档和两个类型的方法的查询基于示例
参考指南。






选项1:后置过滤器



请参阅从弹性搜索引用的例子



我们可以做的是在我们的原始查询中有更多的结果,所以我们可以聚合'on'那些结果,然后
过滤我们的实际结果。



这个例子很清楚地解释了:


但也许您也想告诉用户有多少Gucci衬衫可以使用其他颜色。如果您只是在颜色字段中添加术语聚合,则只能返回颜色为红色,因为您的查询只返回Gucci的红色衬衫。



相反,您需要在聚合中包括所有颜色的衬衫,然后将颜色过滤器应用于搜索结果。


查看如何看起来如下在示例代码中。



这个问题是我们无法使用缓存。这是(尚不适用于5.1)弹性搜索指南警告:


性能考虑
只有在需要差异化时才使用post_filter过滤搜索结果和聚合。有时人们会使用post_filter进行常规搜索。



不要这样做! post_filter的性质意味着它在查询之后运行,所以过滤(例如缓存)的任何性能优势都将完全丢失。



post_filter应该仅在组合中使用只有当您需要差分过滤时才可以使用。


然而有一个不同的选项:



选项2:全局聚合



有一种方法可以进行不受搜索查询影响的聚合。
所以,而不是收集很多,聚合在一起,然后过滤,我们只是得到我们过滤的结果,但是在
的所有内容上进行聚合。看看参考



我们可以得到完全相同的结果。我没有看到关于缓存的任何警告,但似乎最终
我们需要做相同的工作量。所以这可能是唯一的ommission。



由于我们需要的子聚合,它有点复杂一些(你不能有全局过滤器
相同的'级别')。



关于使用这个问题的查询,我唯一的抱怨就是你可能需要重复一次,如果你需要为这个
做几个项目。最后,我们可以生成大多数查询,所以重复自己并不是我的用例的一个问题,
,我并不认为这是一个与不能使用缓存匹配的问题。 >

问题



似乎两个函数至少重叠,或者可能提供确切的相同的功能。这让我感到困惑
除此之外,我想知道一个或另一个有没有看到的优势,如果这里有最佳实践?



示例



这主要来自过滤后参考页面,但我添加了全局过滤器查询。



映射和文档

  PUT / shirts 
{
mappings:{
item:{
properties:{
brand:{type:keyword},
color:{type:keyword} ,
model:{type:keyword}
}
}
}
}

PUT / item / 1?refresh
{
品牌:gucci,
color:red,
model:slim
}

PUT / shirts / item / 2?刷新
{
品牌:gucci,
color:blue,
model:slim
}


PUT / shirts / item / 3?refresh
{
brand:gucci,
color:red,
model:normal
}


PUT / shirts / item / 4?refresh
{
brand:gucci ,
color:blue,
model:wide
}


PUT / shirts / item / 5?刷新
{
品牌:nike,
color:blue,
model:wide
}

PUT / shirts / item / 6?refresh
{
brand:nike,
color:red,
model 宽
}

我们现在要求所有红色gucci衬衫(项目1和3 ),这些2件衬衫的衬衫类型(苗条和正常),
,哪些颜色是gucci(红色和蓝色)。



首先,后置过滤器:ge所有的衬衫,聚合红色gucci衬衫的模型和gucci衬衫(所有颜色)的颜色,
和红色gucci衬衫的后过滤器,只显示那些结果:(这与例子有点不同,因为我们
尝试让它接近一个明确的postfilters应用程序作为可能。)

  GET / shirts / _search 
{
aggs:{
colors_query:{
filter:{
term:{
:gucci
}
},
aggs:{
colors:{
terms:{
field color
}
}
}
},
color_red:{
filter:{
bool {
filter:[
{
term:{
color:red
}
},
{
term:{
brand:gucci
}
}
]
}
},
aggs:{
models:{
terms:{
field:model
}
}
}
}
},
post_filter:{
bool :{
filter:[
{
term:{
color:red
}
},
{
term:{
brand:gucci
}
}
]
}
}
}

我们还可以获得所有红色gucci衬衫(我们的原始查询),然后执行模型的全局聚合(所有
红色gucci衬衫)和颜色(所有gucci衬衫)。

  GET / shirts / _search 
{
查询:{
bool:{
filter:[
{term:{color :red}},
{term:{brand:gucci}}
]
}

聚合:{
color_red:{
global:{},
aggs:{
sub_color_red
filter:{
bool:{
filter:[
{term:{color:red}},
{term:{brand:gucci}}
]
}
},
aggs:{
keywords:{
条款:{
field:model
}
}
}
}
}
} ,
colors:{
global:{},
aggs:{
sub_colors:{
filter:{
bool:{
filter:[
{term:{brand:gucci}}
]
}
} ,
aggs:{
keywords:{
terms:{
field:color
}
}
}
}
}
}
}
}

两者都将返回相同的信息,第二个仅由于子聚合引入的额外级别才有所不同。第二个查询看起来有点复杂,但我不认为这是非常有问题的。一个真实世界的查询是由代码生成的,可能是更复杂的方法,它应该是一个很好的查询,如果这意味着复杂的,那么就这样。

解决方案

我们使用的实际解决方案虽然不是直接回答这个问题,但基本上是两个。



这个弹性博客我们得到了初始提示:


有时候,我看到一个过于复杂的搜索,目标是尽可能少地搜索请求。这些倾向于过滤器尽可能晚,完全违反了Filter First的建议。不要害怕使用多个搜索请求来满足您的信息需求。多搜索API可让您发送一批搜索请求。



不要将所有内容都整合到单个搜索请求中。


这基本上是我们在上面的查询中做的:一大堆聚合和一些过滤。



让他们并行运行证明是非常快速的。看看多搜索API


A common problem in search interfaces is that you want to return a selection of results, but might want to return information about all documents. (e.g. I want to see all red shirts, but want to know what other colors are available).

This is sometimes referred to as "faceted results", or "faceted navigation". the example from the Elasticsearch reference is quite clear in explaining why / how, so I've used this as a base for this question.

Summary / Question: It looks like I can use both a post-filter or a global aggregation for this. They both seem to provide the exact same functionality in a different way. There might be advantages or disadvantages to them that I don't see? If so, which should I use?

I have included a complete example below with some documents and a query with both types of method based on the example in the reference guide.


Option 1: post-filter

see the example from the Elasticsearch reference

What we can do is have more results in our origional query, so we can aggregate 'on' those results, and afterwards filter our actual results.

The example is quite clear in explaining it:

But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.

Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results.

See for how this would look below in the example code.

An issue with this is that we cannot use caching. This is in the (not yet available for 5.1) elasticsearch guide warned about:

Performance consideration Use a post_filter only if you need to differentially filter search results and aggregations. Sometimes people will use post_filter for regular searches.

Don’t do this! The nature of the post_filter means it runs after the query, so any performance benefit of filtering (such as caches) is lost completely.

The post_filter should be used only in combination with aggregations, and only when you need differential filtering.

There is however a different option:

Option 2: global aggregations

There is a way to do an aggregation that is not influenced by the search query. So instead of getting a lot, aggregate on that, then filter, we just get our filtered results, but do aggregations on everything. Take a look at the reference

We can get the exact same results. I did not read any warnings about caching for this, but it seems like in the end we need to do about the same amount of work. So that maybe the only ommission.

It is a tiny bit more complicated because of the sub-aggregation we need (you can't have global and a filter on the same 'level').

The only complaint I read about queries using this, is that you might have to repeat yourself if you need to do this for several items. In the end we can generate most queries, so repeating oneself isn't that much of an issue for my usecase, and I do not really consider this an issue on par with "can not use cache".

Question

It seems both functions are overlapping in the least, or possibly providing the exact same functionality. This baffles me. Apart from that, I'd like to know if one or the other has an advantage I haven't seen, and if there is any best practice here?

Example

This is largely from the post-filter reference page, but I added the global filter query.

mapping and documents

PUT /shirts
{
    "mappings": {
        "item": {
            "properties": {
                "brand": { "type": "keyword"},
                "color": { "type": "keyword"},
                "model": { "type": "keyword"}
            }
        }
    }
}

PUT /shirts/item/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}

PUT /shirts/item/2?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "slim"
}


PUT /shirts/item/3?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "normal"
}


PUT /shirts/item/4?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "wide"
}


PUT /shirts/item/5?refresh
{
    "brand": "nike",
    "color": "blue",
    "model": "wide"
}

PUT /shirts/item/6?refresh
{
    "brand": "nike",
    "color": "red",
    "model": "wide"
}

We are now requesting all red gucci shirts (item 1 and 3), the types of shirts we have (slim and normal) for these 2 shirts, and which colors gucci there are (red and blue).

First, a post filter: get all shirts, aggregate the models for red gucci shirts and the colors for gucci shirts (all colors), and post-filter for red gucci shirts to show only those as results: (this is a bit different from the example, as we try to get it as close to a clear application of postfilters as possilbe.)

GET /shirts/_search
{
  "aggs": {
    "colors_query": {
      "filter": {
        "term": {
          "brand": "gucci"
        }
      },
      "aggs": {
        "colors": {
          "terms": {
            "field": "color"
          }
        }
      }
    },
    "color_red": {
      "filter": {
        "bool": {
          "filter": [
            {
              "term": {
                "color": "red"
              }
            },
            {
              "term": {
                "brand": "gucci"
              }
            }
          ]
        }
      },
      "aggs": {
        "models": {
          "terms": {
            "field": "model"
          }
        }
      }
    }
  },
  "post_filter": {
    "bool": {
      "filter": [
        {
          "term": {
            "color": "red"
          }
        },
        {
          "term": {
            "brand": "gucci"
          }
        }
      ]
    }
  }
}

We could also get all red gucci shirts (our origional query), and then do a global aggregation for the model (for all red gucci shirts) and for color (for all gucci shirts).

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggregations": {
    "color_red": {
      "global": {},
      "aggs": {
        "sub_color_red": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "color": "red"   }},
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "model"
              }
            }
          }
        }
      }
    },
    "colors": {
      "global": {},
      "aggs": {
        "sub_colors": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "color"
              }
            }
          }
        }
      }
    }
  }
}

Both will return the same information, the second one only differs because of the extra level introduced by the sub-aggregations. The second query looks a bit more complex, but I don't think this is very problematic. A real world query is generated by code, probably way more complex anyway and it should be a good query and if that means complicated, so be it.

解决方案

The actual solution we used, while not a direct answer to the question, is basically "neither".

From this elastic blogpost we got the initial hint:

Occasionally, I see an over-complicated search where the goal is to do as much as possible in as few search requests as possible. These tend to have filters as late as possible, completely in contrary to the advise in Filter First. Do not be afraid to use multiple search requests to satisfy your information need. The multi-search API lets you send a batch of search requests.

Do not shoehorn everything into a single search request.

And that is basically what we are doing in above query: a big bunch of aggregations and some filtering.

Having them run in parallel proved to be much and much quicker. Have a look at the multi-search API

这篇关于后置过滤器和全面聚合的分面搜索有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆