Elasticsearch索引比它索引的日志的实际大小大得多吗? [英] Elasticsearch index much larger than the actual size of the logs it indexed?

查看:89
本文介绍了Elasticsearch索引比它索引的日志的实际大小大得多吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到elasticsearch在一夜之间消耗了超过30GB的磁盘空间.相比之下,我要索引的所有日志的总大小仅为5 GB ...嗯,甚至不是,实际上更像是2.5-3GB.是否有任何原因,有没有办法重新配置它?我正在运行ELK堆栈.

解决方案

Elasticsearch内部的数据比源数据大得多的原因有很多.一般而言,Logstash和Lucene都在向没有相对非结构化的数据 add 结构进行工作.这会带来一些开销.

如果您使用的是3 GB的源,而索引数据为30 GB,则这是源数据的大约10倍.那是很大的,但不一定是闻所未闻的.如果在该度量中包括副本的大小,则30 GB可能是完全合理的.根据我自己的经验和直觉,我可能期望相对于源数据在3-5倍的范围内,具体取决于数据的类型以及您在Elasticsearch中使用的存储和分析设置.

在尝试精简Elasticsearch索引时,您可以尝试以下四种不同的设置.

_source字段

Elasticsearch保留每个传入文档的原始原始JSON副本.如果您想重建索引的原始内容或在搜索结果中突出显示匹配项,则很有用,但它肯定会加起来.您可能需要创建一个索引模板,以禁用索引映射中的_source字段.

禁用_source字段可能是磁盘使用率的最大改进.

文档: Elasticsearch _source字段

单个存储的字段

_source字段类似但分开,您可以控制是否按字段存储字段的值.非常简单,在映射"文档中针对核心类型多次提及.

如果您想要一个很小的索引,则只应在搜索响应中存储需要返回的最起码的最小字段.与与主要数据存储区关联的文档ID一样,它可能少之多.

文档:针对核心类型的Elasticsearch映射

_all字段

有时您想查找与给定术语匹配的文档,并且您实际上并不在乎该术语出现在哪个字段中.在这种情况下,Elasticsearch有一个特殊的_all字段,它将所有术语都推入其中.文档中的所有字段.

这很方便,但是如果您的搜索完全针对特定字段,并且您没有尝试松散地匹配索引中任何地方的任何内容,那么您可以不用使用_all字段而摆脱困境. /p>

文档: Elasticsearch _all字段

一般分析

这又回到了Lucene的主题,即向其他非结构化数据中添加结构.您要搜索的任何字段都需要进行分析.这是将一堆非结构化文本分解为 tokens ,然后分析每个令牌以对其进行规范化或将其扩展为多种形式的过程.这些标记将插入词典中,并且还会维护术语与它们出现在其中的文档(和字段)之间的映射.

这全部占用空间,对于某些字段,您可能并不希望对其进行分析.索引编制时,跳过分析还节省了一些CPU时间.某些类型的分析确实可以使您的总条款膨胀,例如使用具有自由设置的n-gram分析器将原始条款分解为许多较小的条款.

文档:分析和分析仪简介

更多阅读内容

I noticed that elasticsearch consumed over 30GB of disk space over night. By comparison the total size of all the logs I wanted to index is only 5 GB...Well, not even that really, probably more like 2.5-3GB. Is there any reason for this and is there a way to re-configure it? I'm running the ELK stack.

There are a number of reasons why the data inside of Elasticsearch would be much larger than the source data. Generally speaking, Logstash and Lucene are both working to add structure to data that is otherwise relatively unstructured. This carries some overhead.

If you're working with a source of 3 GB and your indexed data is 30 GB, that's a multiple of about 10x over your source data. That's big, but not necessarily unheard of. If you're including the size of replicas in that measurement, then 30 GB could be perfectly reasonable. Based on my own experience and intuition, I might expect something in the 3–5x range relative to source data, depending on the kind of data, and the storage and analysis settings you're using in Elasticsearch.

Here are four different settings you can experiment with when trying to slim down an Elasticsearch index.

The _source Field

Elasticsearch keeps a copy of the raw original JSON of each incoming document. It's useful if you ever want to reconstruct the original contents of your index, or for match highlighting in your search results, but it definitely adds up. You may want to create an index template which disables the _source field in your index mappings.

Disabling the _source field may be the single biggest improvement in disk usage.

Documentation: Elasticsearch _source field

Individual stored fields

Similarly but separately to the _source field, you can control whether to store the values of a field on a per-field basis. Pretty straightforward, and mentioned a few times in the Mapping documentation for core types.

If you want a very small index, then you should only store the bare minimum fields that you need returned in your search responses. That could be as little as just the document ID to correlate with a primary data store.

Documentation: Elasticsearch mappings for core types

The _all Field

Sometimes you want to find documents that match a given term, and you don't really care which field that term occurs in. For that case, Elasticsearch has a special _all field, into which it shoves all the terms in all the fields in your documents.

It's convenient, but if your searches are fairly well targeted to specific fields, and you're not trying to loosely match anything/everything anywhere in your index, then you can get away with not using the _all field.

Documentation: Elasticsearch _all field

Analysis in general

This is back to the subject of Lucene adding structure to your otherwise unstructured data. Any fields which you intend to search against will need to be analyzed. This is the process of breaking a blob of unstructured text into tokens, and analyzing each token to normalize it or expand it into many forms. These tokens are inserted into a dictionary, and mappings between the terms and the documents (and fields) they appear in are also maintained.

This all takes space, and for some fields, you may not care to analyze them. Skipping analysis also saves some CPU time when indexing. Some kinds of analysis can really inflate your total terms, like using an n-gram analyzer with liberal settings, which breaks down your original terms into many smaller ones.

Documentation: Introduction to Analysis and Analyzers

More reading

这篇关于Elasticsearch索引比它索引的日志的实际大小大得多吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆