元素词位置-概念性问题 [英] element word positions - conceptual questions

查看:66
本文介绍了元素词位置-概念性问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解element word positions索引设置的影响. 请参阅以下xquery,该查询返回简单的element-word-query搜索计划:

xdmp:plan(cts:search(doc(), 
  cts:and-query(
    cts:element-word-query(xs:QName("name"), "element word position")
  ),
  ("unfiltered")
))

final-plan如果未激活索引(为节省空间而将其简化):

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:term-query>word("element")</qry:term-query>
    <qry:term-query>word("word")</qry:term-query>
    <qry:term-query>word("position")</qry:term-query>
</qry:and-query>

激活索引(word-positions以及element word positions)后的查询计划:

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:element-query>
        element(name)
        <qry:word-query>
            <qry:KP pos="0">word("element")</qry:KP>
            <qry:KP pos="1">word("word")</qry:KP>
            <qry:KP pos="2">word("position")</qry:KP>
        </qry:word-query>
    </qry:element-query>
</qry:and-query>

所以我认为,由于生成的term-query少得多,因此,候选片段ID的计数将变小,因此索引分辨率下的交集会更快.除此之外,我真的很想了解element-query是如何在引擎盖下工作的.所以我有几个问题:

  • 如果激活了element word positions,则索引中还会保存哪些其他信息?
  • 索引和发布列表的外观如何?关键字是仅元素还是元素与单词的组合?是否有任何可视化的图形资源? (没想到你会画些东西)
  • element-query又如何执行?我看到一个简单的term-query如何返回术语键的发布列表,但是我不确定如何评估以word-query作为子查询"的element-query.

添加了一张图片,使我对启用元素词位置后索引的外观形象化. (有关详细信息,请参见mholstege的答案注释)

解决方案

当您打开职位时,我们会在相关术语的索引中存储每个文档的职位向量,而不仅仅是文档ID.

对此问题的思考方式取决于叶查询的特殊性以及计算叶查询和将中间结果相交所涉及的工作.

当您在查询计划中看到一个词条查询时,这意味着它只是在查找文档ID,因此没有相对位置的知识-对于像这样的长短语,结果的准确性较低,因为"单词"和单词位置"可能出现在文档中的两个单独的父元素中.如果您的数据在每个文档中仅包含一个具有该名称的元素,则不会发生这种情况,尽管您仍然可能会出现错误匹配,即两个单词的子短语以相反的顺序出现或由其他单词分隔.

当您在查询计划中看到单词查询时,这意味着我们将要查看位置,在这里您将看到短语中每个单词的相对位置.解决此问题后,我们将检查位置矢量,并将那些不意味着此位置约束的位置扔掉.因此,所有匹配项都将按以下顺序排列以下单词序列:更精确的匹配项.

计划中的element-query还对元素实例相对于元素内部的匹配项施加位置约束.在某些优化中,实际上将元素位置约束下推到查询树的叶子上,以避免过多的中间计算.

您还会看到一些技术上冗余的术语查询:这些关键字的目的是进行简单的术语查询,这些查询可能比叶子词查询更受限制.由于来自与查询的术语列表的交集总是从最短的匹配发布列表开始,因此这可以提供一种快速失败机制,从而避免了更昂贵的职位计算.在这种情况下,存在一定程度的启发式判断,并且鉴于一组复杂的索引选项和查询变体,有时,这些附加术语实际上无济于事.

I'm trying to understand the impact of the element word positions index setting. See the following xquery which returns the plan of a simple element-word-query search:

xdmp:plan(cts:search(doc(), 
  cts:and-query(
    cts:element-word-query(xs:QName("name"), "element word position")
  ),
  ("unfiltered")
))

And the final-plan if the index is not activated (reduced form to save space):

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:term-query>word("element")</qry:term-query>
    <qry:term-query>word("word")</qry:term-query>
    <qry:term-query>word("position")</qry:term-query>
</qry:and-query>

Query plan after the index is activated (word-positions and also element word positions):

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:element-query>
        element(name)
        <qry:word-query>
            <qry:KP pos="0">word("element")</qry:KP>
            <qry:KP pos="1">word("word")</qry:KP>
            <qry:KP pos="2">word("position")</qry:KP>
        </qry:word-query>
    </qry:element-query>
</qry:and-query>

So i assume, because there are a lot less term-query generated, the resulting candidate fragment id count is going to be smaller and thus the intersection at index resolution is faster. Other than that i'd really like to understand how a element-query works under the hood. So i've got a few questions:

  • What kind of additional information is saved in the index if element word positions is activated?
  • How would the index and posting list look like? Is the key only the element or a element+word combination? Are there any graphical resources which visualize it? (not expection you to draw something)
  • Also how does a element-query execute? I see how a simple term-query returns the posting list of the term key, but i am not sure how a element-query with a word-query as a "sub-query" is evalutated.

Edit: Added a picture to visualize my understanding of how the index might look with element word positions enabled. (See mholstege's answers comments for details)

解决方案

When you turn on positions, we store a positions vector for each document in the index for the relevant term, instead of just the document id.

The way to think about this is in terms of the specificity of the leaf queries and the work involved in calculating them and intersecting intermediate results.

When you see a term-query in the query plan, that means it is just looking up document ids, so there is no knowledge of relative positioning -- a less accurate result for a long phrase like this, because the "element word" and "word position" could be occurring in two separate parent elements in the document. If your data only ever has one element with this name in each document, that could not happen, although you could still have false matches where the two-word subphrases occur in, say, the reverse order, or separated by other words.

When you see word-query in the query plan, that means we are going to be looking at positions, and here you see the relative positions for each of the words in the phrase. When this is resolved, we examine the positions vector and toss out the ones that don't mean this positional constraint. So all the matches will have this sequence of words in this order: a more precise match.

The element-query in the plan is also applying positional constraints, of the element instances relative to the matches inside the element. There are optimizations where the element positional constraints are actually pushed down to the leaves of the query tree to avoid excess intermediate calculations.

You also see some technically redundant term queries: the point of these is to do simple term lookups that are probably more constrained than the leaf word queries. Since intersection of term lists from an and-query always proceeds from the shortest matching posting list, this can provide a fail-fast mechanism to avoid the more expensive positions calculations. There is a certain amount of heuristic judgement in that, and given a complex set of index options and query variations, sometimes those additional terms are, in fact, not helpful.

这篇关于元素词位置-概念性问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆