了解弹性搜索中的细分 [英] Understanding Segments in Elasticsearch

查看:103
本文介绍了了解弹性搜索中的细分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的假设是,弹性搜索中的每个碎片都是一个索引。但是我读到某个地方,每个段都是Lucene索引。



细分到底是什么?它如何影响搜索性能?我的索引每天大小达到450GB(我每天都会创建一个新的)。



当我执行 curl -XPOST http:// localhost:9200 / logstash-2013.03.0 $ i_optimize?max_num_segments = 1,我得到
num_committed_segments = 11 num_search_segments = 11



上述值不应该是1吗?也许是因为 index.merge.policy.segments_per_tier 的值?这个层次是什么?

解决方案

索引一词在Elasticsearch中被滥用 - 适用于太多的东西。



解释:



索引



Elasticsearch中的index有点像关系数据库中的数据库。这是您存储/索引数据的位置。但实际上,这只是你的应用程序看到的。在内部,索引是指向一个或多个分片的逻辑命名空间。



此外,索引表示将数据放入到弹性搜索。您的数据都存储(用于检索)和索引进行搜索。



倒排索引



反向索引是Lucene用于使数据可搜索的数据结构。它处理数据,提取独特的术语或令牌,然后记录包含那些令牌的文档。有关详情,请参阅 http://en.wikipedia.org/wiki/Inverted_index



shard



shard是Lucene的一个实例。它是一个功能齐全的搜索引擎。 索引可以由单个分片组成,但通常由几个分片组成,以允许索引增长并在几台机器上分割。



主要碎片是文档的主要文件。 复制分片是主分片的副本,提供(1)故障转移,以防主要死亡和(2)增加读取吞吐量





每个分片都包含多个段,其中段是反向索引。在分片中搜索会依次搜索每个片段,然后将其结果合并到该分片的最终结果中。



在索引文档时,Elasticsearch将其记录在内存中(并且在事务日志中,为了安全)然后每隔一秒钟,将一个新的小段写入磁盘,并刷新搜索。



这使得数据在可见搜索的新片段(即它们是可搜索的)中,但片段没有被fsync'ed到磁盘,所以仍然有数据丢失的风险。



Elasticsearch经常会刷新,这意味着fsync的段(他们现在是提交),并清除事务日志,这是不再需要,因为我们知道新的数据已写入磁盘。



越多的分段越多,每次搜索的时间越长。所以Elasticsearch将通过后台合并过程将大量相似大小的段(tier)合并成一个更大的段。一旦新的更大的段被写入,旧的段被删除。当有太多相同的大小时,这个过程会在较大的细分上重复。



细分是不可变的。当文档更新时,它实际上将旧文档标记为已删除,并为新文档编制索引。合并过程也会清除这些旧的已删除的文档。


I was under the assumption that each shard in Elasticsearch is an index. But I read somewhere that each segment is a Lucene index.

What exactly is a segment? How does it effect search performance? I have indices that reach around 450GB in size everyday (I create a new one everyday) with default Elasticsearch settings.

When I execute curl -XPOST "http://localhost:9200/logstash-2013.03.0$i_optimize?max_num_segments=1", I get num_committed_segments=11 and num_search_segments=11.

Shouldn't the above values be 1? Maybe it's because of index.merge.policy.segments_per_tier value? What is this tier anyway?

解决方案

The word "index" gets abused a bit in Elasticsearch -- applies to too many things.

To explain:

index

An "index" in Elasticsearch is a bit like a database in a relational DB. It's where you store/index your data. But actually, that's just what your application sees. Internally, an index is a logical namespace that points to one or more shards.

Also, "to index" means to "put" your data into Elasticsearch. Your data is both stored (for retrieval) and "indexed" for search.

inverted index

An "inverted index" is the data structure that Lucene uses to make data searchable. It processes the data, pulls out unique terms or tokens, then records which documents contain those tokens. See http://en.wikipedia.org/wiki/Inverted_index for more.

shard

A "shard" is an instance of Lucene. It is a fully functional search engine in its own right. An "index" could consist of a single shard, but generally consists of several shards, to allow the index to grow and to be split over several machines.

A "primary shard" is the main home for a document. A "replica shard" is a copy of the primary shard that provides (1) failover in case the primary dies and (2) increased read throughput

segment

Each shard contains multiple "segments", where a segment is an inverted index. A search in a shard will search each segment in turn, then combine their results into the final results for that shard.

While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and "refreshes" the search.

This makes the data in the new segment visible to search (ie they are "searchable"), but the segment has not been fsync'ed to disk, so is still at risk of data loss.

Every so often, Elasticsearch will "flush", which means fsync'ing the segments, (they are now "committed") and clearing out the transaction log, which is no longer needed because we know that the new data has been written to disk.

The more segments there are, the longer each search takes. So Elasticsearch will merge a number of segments of a similar size ("tier") into a single bigger segment, through a background merge process. Once the new bigger segment is written, the old segments are dropped. This process is repeated on the bigger segments when there are too many of the same size.

Segments are immutable. When a document is updated, it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents.

这篇关于了解弹性搜索中的细分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆