geohash 索引在 Lucene 中是如何工作的 [英] How does geohash index work in Lucene

查看:11
本文介绍了geohash 索引在 Lucene 中是如何工作的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 lucene spatial 4 中,我想知道 geohash 索引是如何在幕后工作的.我理解 geohash 的概念,它基本上需要 2 个点(纬度、经度)并创建一个字符串"哈希.

In lucene spatial 4 I'm wondering how the geohash index works behind the scenes. I understand the concept of the geohash which basically takes 2 points (lat, lon) and creates a single "string" hash.

索引只是字符串"索引(r-tree 或 quad-tree)还是类似的东西(例如仅索引姓氏)......或者它有什么特别之处.

Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.

对于预先固定的类型搜索,是否对哈希的所有 n-gram 进行索引,例如如果 geohash 是

For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is

drgt2abc 是否将其索引为 d、dr、drg、drgt 等.

drgt2abc does this get indexed as d, dr, drg, drgt, etc..

是否存在我们可能希望索引的默认 n-gram 数量?

Is there a default number of n-grams that we might want indexed?

使用这种类型的索引将搜索具有 10 万条记录的查询,而 1 亿条记录对于空间查询具有相似的查询性能.(例如框/多边形或距离)或者我是否可以预期随着大量记录的添加,索引会出现一般/典型的缓慢降级.

With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.

谢谢

推荐答案

网上最好的解释是我的视频:Lucene/Solr 4 空间深潜

The best online explanation is my video: Lucene / Solr 4 Spatial deep dive

索引只是字符串"索引(r-tree 或 quad-tree)还是什么沿着这些线(例如仅索引姓氏)......或者在那里有什么特别之处.

Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.

从根本上说,Lucene 只有一个用于文本、数字和现在空间的索引.你可以说它是一个字符串索引.它是字节/字符串的排序列表.从更高层次来看,以这种方式使用空间是计算机科学中的Tries"又名PrefixTrees"家族.

Lucene, fundamentally, has just one index used for text, numbers, and now spatial. You could say it's a string index. It's a sorted list of bytes/strings. From a higher level view, using spatial in this way is the family of "Tries" AKA "PrefixTrees" in computer science.

对于前缀类型的搜索,所有的 n-gram 哈希都得到索引,例如如果 geohash 是

For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is

drgt2abc 是否将其索引为 d、dr、drg、drgt 等.

drgt2abc does this get indexed as d, dr, drg, drgt, etc..

是的.

是否存在我们可能希望索引的默认 n-gram 数量?

Is there a default number of n-grams that we might want indexed?

您可以方便地告诉它您的精度要求,它会查找需要多长时间.或者你可以通过长度来判断.

You tell it conveniently in terms of the precision requirements you have and it'll lookup how long it needs to be. Or you can tell it by length.

使用这种类型的索引将搜索 10 万个查询记录与 1 亿条记录具有相似的查询性能空间查询.(例如框/多边形或距离)或者我可以期待一个索引的一般/典型缓慢退化为大量记录已添加.

With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.

确实,这种类型的索引(更具体地说是使用它的智能递归搜索树算法)意味着您将获得可扩展的搜索性能.100m 是一个过滤器要匹配的大量文档,因此它当然会比仅匹配 100k 文档的文档要慢,但它绝对是次线性的.到明年它会更快,因为今年夏天在新的 PrefixTree 编码上进行了工作,加上正在进行的空间基准测试,这将使我能够进行我计划的进一步调整优化.

Indeed, this type of indexing (and more specifically the clever recursive search tree algorithm that uses it) means that you'll have scalable search performance. 100m is a ton of documents for one filter to match so it's of course going to be slower than one that matches only 100k docs, but it's definitely sub-linear. And by next year it'll be even faster, due to work happening this summer on a new PrefixTree encoding plus a spatial benchmark in progress which will allow me to make further tuning optimizations I have planned.

这篇关于geohash 索引在 Lucene 中是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆