查询产品目录 RavenDB 商店以获取任意产品集合的规格聚合 [英] Query product catalog RavenDB store for spec aggregate over arbitrary collection of products

查看:104
本文介绍了查询产品目录 RavenDB 商店以获取任意产品集合的规格聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是this 问题.

我有以下模型:

class Product {
  public string Id { get; set; }
  public string[] Specs { get; set; }
  public int CategoryId { get; set; }
}

Specs"数组存储由特殊字符连接的产品规范名称值对.例如,如果产品为蓝色,则规格字符串将为Color~Blue".以这种方式表示规格允许查询具有由查询指定的多个规格值的产品.我想支持两个主要查询:

The "Specs" array stores product specification name value pairs joined by a special character. For example if a product is colored blue the spec string would be "Color~Blue". Representing specs in this way allows querying for products having multiple spec values specified by a query. There are two principal queries that I would like to support:

  1. 获取给定类别中的所有产品.
  2. 获取给定类别中具有一组指定规格的所有产品.

这适用于 RavenDB.但是,除了满足给定查询的产品之外,我还想返回一个结果集,其中包含查询指定的产品集的所有规范名称-值对.规范名称-值对应按规范的名称和值分组,并包含具有给定规范名称-值对的产品计数.对于查询 #1,我创建了以下地图缩减索引:

This works well with RavenDB. However, in addition to the products satisfying a given query I would like to return a result set which contains all spec name-value pairs for the set of products specified by the query. The spec name-value pairs should be grouped by the name and value of the spec and contain a count of products which have a given spec name-value pair. For query #1 I created the following map reduce index:

class CategorySpecGroups {
    public int CategoryId { get; set; }
    public string Spec { get; set; }
    public int Count { get; set; }
}


public class SpecGroups_ByCategoryId : AbstractIndexCreationTask<Product, CategorySpecGroups>
{
    public SpecGroups_ByCategoryId()
    {
        this.Map = products => from product in products
                               where product.Specs != null
                               from spec in product.Specs
                               select new
                               {
                                   CategoryId = product.CategoryId,
                                   Spec = spec,
                                   Count = 1
                               };

        this.Reduce = results => from result in results
                                 group result by new { result.CategoryId, result.Spec } into g
                                 select new
                                 {
                                     CategoryId = g.Key.CategoryId,
                                     Spec = g.Key.Spec,
                                     Count = g.Sum(x => x.Count)
                                 };
    }
}

然后我可以查询此索引并获取给定类别中的所有规范名称-值对.我遇到的问题是获得相同的结果集,但对于按类别和一组规范名称-值对进行过滤的查询.使用 SQL 时,将通过对按类别和规格过滤的一组产品进行分组来获得此结果集.一般来说,这种类型的查询很昂贵,但是当按类别和规格过滤时,产品集通常很小,虽然不足以放入单个页面 - 它们可能包含多达 1000 种产品.作为参考,MongoDB 支持 group 方法,可用于实现相同的结果集.这执行了 ad hoc 分组服务器端,性能可以接受.

I can then query this index and get all spec name-value pairs in a given category. The problem I am running into is to get the same result set but for a query which filters both by a category and a set of spec name-value pairs. When using SQL this result set would be obtained by doing a group by over a set of products filtered by category and specs. In general, this type of query is expensive but when filtering by both category and specs the product sets are normally small, though not small enough to fit into a single page - they may contain up to 1000 products. For reference, MongoDB supports a group method which can be used to achieve the same result set. This performs the ad hoc grouping server side and the performance is acceptable.

如何使用 RavenDB 获取此类结果集?

How can I get this type of result set using RavenDB?

一种可能的解决方案是获取查询的所有产品并在内存中执行分组,另一种选择是创建上述 mapreduce 索引,尽管这样做的挑战是推导出所有可能的规范选择一个给定的类别,此外,这种类型的索引可能会爆炸式增长.

One possible solution is to get all the products for a query and perform the grouping in memory and another option is to create a mapreduce index as above, though the challenge with this would be deducing all possible spec selections that can be made for a given category and additionally, this type of index might explode in size.

举个例子,看看此紧固件类别页面.用户可以通过选择属性来过滤他们的选择.选择一个属性后,它会缩小产品的选择范围并显示新产品集中的属性.这种类型的交互通常称为分面搜索.

For an example, take a look at this fastener category page. The user can filter their selection by selecting attributes. When an attribute is selected it narrows the selection of products and displays the attributes within the new set of products. This type of interaction is typically called faceted search.

编辑

与此同时,我将尝试使用 Solr 的解决方案,因为它们支持分面搜索盒子里.

In the meantime, I will be attempting a solution using Solr as they support faceted search out of the box.

编辑 2

看来 RavenDB 也支持分面搜索(当然有道理,索引是Lucene 像 Solr 一样存储).我将对此进行探索并发布更新.

It appears that RavenDB also supports faceted search (which of course makes sense, indexes are stored by Lucene just like Solr). I will be exploring this and post updates.

编辑 3

RavenDB 分面搜索功能按预期工作.我为每个类别 ID 存储一个构面设置文档,用于计算给定类别内查询的构面.我现在遇到的问题是性能.对于具有 4500 个不同类别的 500k 产品的集合,导致 4500 个方面设置文档,按类别 id 查询在查询方面时需要大约 16 秒,在不查询方面时大约需要 0.05 秒.测试的特定类别包含大约 6k 个产品、23 个不同的方面和 2k 个不同的方面名称范围组合.查看FacetedQueryRunner中的代码后出现的方面查询将导致对每个方面名称-值组合进行 Lucene 查询以获取计数,以及对每个方面名称进行查询以获取术语.该实现的一个问题是,无论查询如何,它都会检索给定方面名称的所有不同术语,这在大多数情况下会显着减少方面的术语数量,从而减少 Lucene 查询的数量.此处提高性能的一种方法是为每个构面设置文档存储一个 MapReduce 计算结果集(如上所示),然后在进一步按构面过滤时可以查询该结果集以获取所有不同的术语.但是整体性能可能仍然太慢.

The RavenDB faceted search functionality works as expected. I store a facet setup document for each category ID which is used to calculate facets for a query within a given category. The issue I am having now is performance. For a collection of 500k products with 4500 distinct categories resulting in 4500 facet setup documents a query by category id takes about 16 seconds when also querying for facets and about 0.05 seconds when not querying for facets. The particular category tested contains about 6k products, 23 distinct facets and 2k distinct facet name-range combinations. After looking at the code in FacetedQueryRunner it appears a facets query will result in a Lucene query for every facet name-value combination to get the counts, as a well as a query for each facet name to get the terms. One problem with the implementation is that it will retrieve all the distinct terms for a given facet name regardless of the query, which in most cases will significantly reduce the number of terms for a facet and therefore reduce the number of Lucene queries. One way to improve performance here would be to store a MapReduce computed result set (as shown above) for each facet setup document which could then be queried to get all the distinct terms when further filtering by facets. The overall performance however may still be too slow.

推荐答案

我已经使用 RavenDB 实现了这个功能分面搜索,但是我对进行了一些更改FacetedQueryRunner 支持启发式优化.启发式是,就我而言,构面仅显示在叶类别中.这是一个合理的限制,因为根类别和内部类别之间的导航可以由子类别的搜索或列表驱动.

I've implemented this feature using RavenDB faceted search, however I made some changes to FacetedQueryRunner to support a heuristic optimization. The heuristic is that, in my case, facets are only displayed in leaf categories. This is a reasonable constraint since navigation between root and internal categories can be driven by either search or listings of child categories.

现在给定约束,我为每个叶类别存储一个 FacetSetup 文档,Id 类似于facets/category_123".存储构面设置文档时,我可以访问类别中包含的构面名称和构面值(或范围).因此,我可以在 FacetSetup 文档的每个 Facet 的 Ranges 集合中存储所有可用的 facet 值,但是 facet 模式仍然是 FacetMode.Default.

Now given the constraint I store a FacetSetup document for each leaf category with the Id being something like "facets/category_123". When the facet setup document is being stored I have access to the facet names as well as facet values (or ranges) that are contained in the category. Therefore, I can store all available facet values in the Ranges collection of each Facet in the FacetSetup document, however the facet mode is still FacetMode.Default.

这里是对 FacetedQueryRunner 的更改.具体来说,优化会检查给定的构面是否存储范围,在这种情况下,它返回这些值以用于搜索,而不是获取与给定构面关联的索引中的所有术语.在大多数情况下,这将显着减少所需的 Lucene 搜索次数,因为给定类别中的可用构面值是整个索引中构面值的子集.

Here are the changes to FacetedQueryRunner. Specifically, the optimization checks to see if a given facet stores ranges, in which case it returns those values to use for searching instead of getting all terms in an index associated with a given facet. In most cases this will significantly reduce the number of Lucene searches that are required since there available facet values in a given category are a subset of facet values in the entire index.

可以进行的下一个优化是,如果原始查询仅按类别 id 进行过滤,则 FacetSetup 文档实际上也可以存储计数.一种,尽管很老套,但这样做的方法是将计数附加到 Ranges 集合中的每个构面值,然后向 FacetSetup 文档添加一个布尔值以指示附加计数.现在这个 facet 查询将基本上返回 FacetSetup 文档中的值 - 无需查询.

The next optimization that can be made is that if the original query only filters by a category id, then the FacetSetup document can actually store the counts as well. One, albeit hacky, way to do this would be to append the count to each facet value in the Ranges collection, then add a boolean to FacetSetup document to indicate that counts are appended. Now this facet query will basically return the values in the FacetSetup document - no need to query.

现在需要考虑的是保持 FacetSetup 文档是最新的,但无论如何这都是必需的.除了可以利用这种优化缓存之外,我相信这是 Solr 分面搜索所采用的方法.

A consideration now would be to keep the FacetSetup documents up to date, however this would be required either way. Beyond this optimization caching can be utilized, which is I believe the approach taken by Solr faceted search.

此外,如果 FacetSetup 文档自动与产品集合同步会很好,因为它们实际上是对最初按类别 id 分组的产品集进行聚合 MapReduce 操作的结果,然后是方面的名称,然后是值.

Furthermore, it would be nice if the FacetSetup documents where automatically synchronized with the product collection since effectively they are result of an aggregating MapReduce operation over the set of products grouping initially by category id, then the name of the facet and then the values.

这篇关于查询产品目录 RavenDB 商店以获取任意产品集合的规格聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆