分析vs not_analyzed:存储大小 [英] analyzed vs not_analyzed: storage size
问题描述
我最近开始使用ElasticSearch 2.当我在映射中分析 vs not_analyzed 时,not_analyzed应该更好的存储(和 https://www.elastic.co/blog/ elasticsearch存储最真实的故事)。
为了测试的目的,我创建了一些索引,所有的String字段都被分析(默认情况下),然后我创建了一些其他所有字段的索引为not_analyzed,当我检查索引的大小时我惊讶地发现与not_analyzed Strings的索引是40%更大的!我在每个索引中插入相同的文档(35000个文档)。
I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story). For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).
任何想法为什么会发生这种情况?我的文档是简单的JSON文档。我在每个文档中有60个字符串字段,我想设置为not_analyzed,我尝试将每个字段设置为未分析,并创建一个动态模板。
Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.
我编辑添加映射,虽然我觉得没有什么特别之处:
I edit for adding the mapping, although I think it has nothing special:
{
"mappings": {
"my_type" : {
"_ttl" : { "enabled" : true, "default" : "7d" },
"properties" : {
"field1" : {
"properties" : {
"field2" : {
"type" : "string", "index" : "not_analyzed"
}
more not_analyzed String fields here
...
...
...
}
推荐答案
not_analyzed
fields are still >索引。他们只是事先没有适用于他们的变革(分析 - Lucene的说法)。
not_analyzed
fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).
举个例子: p>
As an example:
(Doc 1)快速的棕色狐狸跳过懒狗
(Doc 1) "The quick brown fox jumped over the lazy dog"
(文件2)像狐狸一样懒惰
(Doc 2) "Lazy like the fox"
- 标准分析器创建的简易发帖列表(分析
的默认值为
字符串字段 - 已标记,低位,已删除)
- Simplified postings list created by Standard Analyzer (default for
analyzed
string fields - tokenized, lowercased, stopwords removed):
"brown": [1]
"dog": [1]
"fox": [1,2]
"jumped": [1]
"lazy": [1,2]
"over": [1]
"quick": [1]
30字符串数据的字符
- 由
index创建的简易发帖列表:not_analyzed
:
- Simplified postings list created by
"index": "not_analyzed"
:
"The quick brown fox jumped over the lazy dog": [1]
"Lazy like the fox": [2]
62个字符值的字符串数据
分析导致输入获得标记化和归一化,以便能够使用术语查找文档。
Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.
但是,文本的单位被缩小到一个标准化的术语(vs整个字段, not_analyzed
),并且所有文档中的所有冗余(标准化)术语都被折叠为单个逻辑列表,可以保存通常为消费的重复条款和停止词。
But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed
), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.
这篇关于分析vs not_analyzed:存储大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!