ArangoDB方面的搜索性能 [英] ArangoDB Faceted Search Performance

查看:604
本文介绍了ArangoDB方面的搜索性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在评估方面计算方面的ArangoDB性能。
还有许多其他产品可以通过特殊的API或查询语言执行相同的操作:

We are evaluating ArangoDB performance in space of facets calculations. There are number of other products capable of doing the same, either via special API or query language:


  • MarkLogic Facets

  • ElasticSearch集合

  • Solr Faceting等

我们了解,Arango中没有用于显式计算事实的特殊API。
但是实际上并不需要,感谢全面的AQL,可以通过简单的查询轻松实现,例如:

We understand, there is no special API in Arango to calculate factes explicitly. But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

此查询计算attribute1上的构面,并产生以下形式的频率:

This query calculate a facet on attribute1 and yields frequency in the form of:

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

在我的整个集合中,attribute1采取三种形式(test-attr1-1,test-attr1-2和test-attr1-3),并提供了相关计数。
我们几乎运行了一个DISTINCT查询并汇总了计数。

It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided. Pretty much we run a DISTINCT query and aggregated counts.

看起来简单干净。

上面提供的查询运行了31秒!仅包含8M个文档的测试集。
我们已经尝试了不同的索引类型,使用了存储引擎(使用rocksdb和不使用rocksdb),无济于事地研究了说明计划。
我们在此测试中使用的测试文档非常简洁,只有三个简短属性。

Provided query above runs for !31 seconds! on top of the test collection with only 8M documents. We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail. Test documents we use in this test are very concise with only three short attributes.

在此,我们将不胜感激。
我们做错了什么。或ArangoDB根本不是设计要在此特定区域执行的。

We would appreciate any input at this point. Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.

btw,最终目标是在不到一秒的时间内运行以下内容:

btw, ultimate goal would be to run something like the following in under-second time:

LET docs = (FOR a IN Asset 

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs 

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs 

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs 

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs 

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs), 

    offset: 2, 

    to: 4, 

    facets: {

      attribute1: {

        from: 0, 

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5, 

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)

   }

}

谢谢!

推荐答案

证明ArangoDB Google Group上发生了主线程。
这是链接到完整的讨论

Turns out main thread has happened on ArangoDB Google Group. Here is a link to a full discussion

以下是当前解决方案的摘要:


  • 运行从已经完成许多性能改进的特定功能分支中自定义Arango构建(希望他们应该尽快将其发布到主版本中)

  • 分面计算不需要索引

  • MMFiles是首选的存储引擎

  • AQL应该编写为使用 COLLECT attr = a.attributeX WITH COUNT INTO length而不是 count: length(g)

  • 应该将AQL分成较小的部分并并行运行(我们正在运行Java8的Fork / Join来扩展构面AQL,然后将它们合并成最终结果)
  • 一个AQL进行过滤/排序和检索主实体(如果需要。在排序/过滤时,添加相应的跳过列表索引)

  • 其余的每个都是小的AQL方面值/频率对

  • Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
  • No indexes are required for a facets calculations
  • MMFiles is a preferred storage engine
  • AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
  • AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
  • One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
  • The rest are small AQLs for each facet value/frequency pairs

最后,与上述原始AQL相比,我们获得了> 10倍的性能提升。

In the end we have gained >10x performance gain compare to an original AQL provided above.

这篇关于ArangoDB方面的搜索性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆