Solr的cursorMark在无状态时如何解决深度分页问题? [英] How does Solr's cursorMark solve deep pagination while being stateless?

查看:604
本文介绍了Solr的cursorMark在无状态时如何解决深度分页问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题以前已经问过,但我不满意,我需要进一步阐述。

这些是基本的前提:


  1. cursorMark是无状态的。它不会在服务器上存储任何内容

  2. cursorMark是一个计算值,用于指示是否跳过文档

我不确定我是否理解正确,但这就是我如何阅读给出的解释。



我的问题是:


  1. 如果cursorMark用于知道要跳过哪些文档,那么这不是搜索?它基本上是通过运行一系列文件进行搜索,并提出问题这是我在找什么或我需要跳过吗?


  2. 仍然与第一个问题有关,这个文档序列是如何计算的?这不是存储在内存中吗?或者cursorMark创建存储在磁盘上的临时文件?


  3. 如果它不记得整个结果的整个宇宙,它如何计算下一个cursorMark?


    我能看到的是没有逃跑。



    $ b



    相关参考资料:



    您可以在搜索结果中存储一些状态或执行搜索每个页面请求。 cursorMark是无状态的,它如何解决深度分页问题 >



    cursorr在solr服务器上可用多少时间



    http://yonik.com/solr/paging-and-deep-paging/

    解决方案

    当cursorMark解决深度分页问题时,你考虑一下Solr服务器的集群,比如在SolrCloud模式下运行。



    假设你有四台服务器, A B C D ,并且想从行号400开始检索10个文档(我们假设一个服务器==一个较大集合的一个分片,以使它更容易)。

    在常规情况下,您必须首先检索(按排序顺序,因为每个节点将根据您的查询对其结果集进行排序 - 这不是任何不同于任何常规查询,因为它将需要本地排序),然后合并:


    • 来自服务器A的410个文档

      li>
    • 来自服务器B的410个文档
    • 410个来自服务器C的文档

    • 来自服务器D的410个文档



    您现在必须通过1640个文档来了解您的结果集的结果集,因为它可能只是您要查找的10个文件全部位于服务器上 C 。或者可能在服务器 B 上有350个,其余的在服务器上 D 。如果没有真正从每台服务器上恢复410份文件,这是不可能的。结果集将被合并和排序,直到跳过400个文档并找到10个文档。

    现在假设您需要从第1行开始的10个文档000 000 - 您必须从每个服务器检索1 000 010个文档,并合并并排序4000个文档的结果集。您可以看到,随着服务器和文档数量的增加,这变得越来越昂贵,只是为了增加10个文档的起点。然而,让我们假设你知道全局排序顺序(即返回的最后一个文档的词法排序值)是什么。没有 cursorMark的 first 查询将与常规分页相同 - 从每个服务器获取前10个文档(因为我们从结果集(而不是像第一个例子中的位置400),我们只需要从每个服务器获得10个)。



    我们处理这40个文档(一个非常易于管理的大小) ,对它们进行排序并检索10个第一个文档,然后我们包含最后一个文档的全局排序键(cursorMark)。然后,客户端在请求中包含这个全局排序键,这允许我们说好的,我们不会对任何在本文档前排序的条目感兴趣,因为我们已经显示了这些条目。然后下一个查询会执行:


    • 来自服务器A的10个文档,它们将排序在 cursorMark
    • 10个来自服务器B的文档,将按照 cursorMark

    • 10进行排序来自服务器C的文档,这些文档将在 cursorMark

    • 之后排序来自服务器D的10个文档,这些文档将排序在 cursorMark



    现在我们仍然只是从每台服务器返回10个文档,即使我们的 cursorMark 是百万行深入到分页。我们之前必须检索的地方(好吧 - 我们可以假设它们是从结果集中返回排序的,所以我们必须经过结果集并找到第一百万个条目,然后从集合中选择后面的十个,之后检索它们)并处理4 000 040个文档,我们现在只需检索40个文档并在本地进行排序以获得实际的10个文档。



    为了进一步解释一个 cursorMark 可以工作,让我们假设这种方法只适用于具有整数值的唯一列(因为这样可以更容易地显示cursorMark在内部表示什么,以及为什么 uniqueKey必须存在于排序中)(if uniqueKey不存在,如果cursorMark结尾在具有多个相同值的排序字段的文档上,我们可以随机结束丢失文档:

      ABCD 
    1 2 3 4
    8 7 6 5
    9 10 11 12
    13 14 15 16
    17 20 21 22
    18 23 25 27
    19 24 26 28

    我们请求4个值(行数= 4) ,从cursorMark 7开始。每个服务器然后可以查看它的结果集(其在内部排序,因为所有结果集都是),并且从排序后的7之后的值开始检索4个值:(<意味着这是cursorMark之后的第一个值,+表示该文档包含在从节点返回的结果集中)

      ABCD 
    1 2 3 4
    8< 7 6 5
    9 + 10 < 11 < 12 <
    13 + 14 + 15 + 16 +
    17 + 20 + 21 + 22 +
    18 23 + 25 + 27 +
    19 24 26 28

    然后我们遍历返回的每个结果集,直到我们从顶端选取四个文档:

      8(来自A)
    9(来自A)
    10(来自B)
    11(来自C)

    我们包含最后一个文档的cursorMark:11.然后下一个请求用11作为cursorMark ,这意味着每个服务器都可以在11之后返回4个文档:

      ABCD 
    1 2 3 4
    8 7 6 5
    9 10 11 12<
    13< 14 < 15 < 16 +
    17 + 20 + 21 + 22 +
    18 + 23 + 25 + 27 +
    19 + 24 + 26 + 28
    pre>

    然后我们再次执行合并,按排序顺序选择前4个条目,并包含下一个cursorMark。



    ..这就回答了第三个问题:它不需要知道全局状态,它需要从巨大的数据集返回的下一个结果。


    This question has been asked before but I'm not satisfied and I need further elaboration.

    These are the basic premises:

    1. cursorMark is stateless. It doesn't store anything on the server
    2. cursorMark is a computed value that tells whether to skip a document or not

    I'm not sure if I understood it properly but that's how I read the given explanations.

    My questions are:

    1. If cursorMark is meant to know which documents to skip then how is this not a search? It basically is a search by running through a sequence of documents and asking the question "is this what I'm looking for or do I need to skip this?"

    2. Still related to the first question, how is this "sequence of documents" calculated? Isn't that stored in the memory? Or is cursorMark creating a temporary file stored on the disk?

    3. How does it calculate the next cursorMark if it doesn't remember about the entire universe of results?

    All I can see is that there's no escape.

    Either you store some state about your search results or perform search for each page requests.

    Related references:

    cursorMark is stateless and how it solves deep paging

    How much time does the cursorMark is available on solr server

    http://yonik.com/solr/paging-and-deep-paging/

    解决方案

    cursorMark doesn't affect search - the search is still performed as it always is. cursorMark isn't an index or relevant for how the actual search is performed, but it's a strategy to allow efficient pagination through large data sets. This also means that your second question becomes moot, as it doesn't change anything about how the actual search is performed.

    The reason why cursorMark solves deep pagination becomes apparent when you consider the case for a cluster of Solr servers, such as when running in SolrCloud mode.

    Let's say you have four servers, A, B, C, and D, and want to retrieve 10 documents starting from row number 400 (we'll assume that one server == one shard of a larger collection to make this easier).

    In the regular case, you'll have to start by retrieving (in sorted order, as each node will sort its result set according to your query - this isn't any different from any regular query as it will need to be sorted locally anyway), and then merging:

    • 410 documents from server A
    • 410 documents from server B
    • 410 documents from server C
    • 410 documents from server D

    You now have to go through 1640 documents to find out what your actual result set will be, as it could just be that the 10 documents you're looking for, all lives on server C. Or maybe 350 on server B and the rest on server D. It's impossible to say without actually retreving 410 documents form each server. The result set will be merged and sorted until 400 documents have been skipped and 10 documents has been found.

    Now say you want 10 documents starting from row 1 000 000 - you'll have to retrieve 1 000 010 documents form each server, and merge and sort through a result set of 4 000 040 documents. You can see this becoming more and more expensive as the number of servers and documents increase, just to increase the starting point by 10 documents.

    Instead, let's assume that you know what the global sort order (meaning the lexical sort value of the last document returned) is. The first query, without a cursorMark, will be the same as for regular pagination - get the first 10 documents from each server (since we're starting at the start of the result set (and not from position 400 as in the first example), we only need 10 from each server).

    We process these 40 documents (a very manageable size), sort them and retrieve the 10 first documents, and then we include the global sort key (the cursorMark) of the last document. The client then includes this "global sort key" in the request, which allows us to say "OK, we're not interested in any entries that would be sorted in front of this document, as we've already shown those". The next query would then do:

    • 10 documents from server A, that would sort after cursorMark
    • 10 documents from server B, that would sort after cursorMark
    • 10 documents from server C, that would sort after cursorMark
    • 10 documents from server D, that would sort after cursorMark

    Now we're still just returning 10 documents from each server, even if our cursorMark is a million rows deep into the pagination. Where we previously had to retrieve, sort (well - we can assume they're returned sorted from the result set, so we have to go through the result sets and find the first million entries, then pick the next ten from the sets, after retrieving them) and handle 4 000 040 documents, we now only have to retrieve 40 documents and sort them locally to get the actual 10 documents to return.

    To further explain how a cursorMark could work, let's assume that this approach only worked on unique columns with an integer value (since that makes it easier to show what the cursorMark represents internally, and why the uniqueKey has to be present in the sort) (if the uniqueKey wasn't present, we could randomly end up with documents missing if the cursorMark ended up on a document with multiple, identical values for the sort field):

    A    B    C    D
    1    2    3    4
    8    7    6    5
    9    10   11   12
    13   14   15   16
    17   20   21   22
    18   23   25   27
    19   24   26   28
    

    We make a request for 4 values (rows=4), starting form cursorMark 7. Each server can then look at its result set (which is sorted internally, as all result sets are), and retrieve 4 values starting from the value that comes after 7 in the sorted order: (< means this is where the first value after the cursorMark is, + means that this document is included in the result set returned from the node)

    A    B    C    D
    1    2    3    4
    8  < 7    6    5
    9  + 10 < 11 < 12 <
    13 + 14 + 15 + 16 +
    17 + 20 + 21 + 22 +
    18   23 + 25 + 27 +
    19   24   26   28
    

    We then iterate through each result set returned, until we've picked four documents from the top:

    8 (from A)
    9 (from A)
    10 (from B)
    11 (from C)
    

    And we include the cursorMark of the last document: 11. The next request is then made with 11 as the cursorMark, meaning that each server can then return 4 documents starting from the entry after 11 instead:

    A    B    C    D
    1    2    3    4
    8    7    6    5
    9    10   11   12 <
    13 < 14 < 15 < 16 +
    17 + 20 + 21 + 22 +
    18 + 23 + 25 + 27 +
    19 + 24 + 26 + 28
    

    And then we perform the merge again, picking the first 4 entries in sorted order, and include the next cursorMark.

    .. and that answers the third question: it doesn't need to know the global state, just what the next result that it needs to return from the gigantic dataset.

    这篇关于Solr的cursorMark在无状态时如何解决深度分页问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆