分析vs not_analyzed:存储大小 [英] analyzed vs not_analyzed: storage size

查看:216
本文介绍了分析vs not_analyzed:存储大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用ElasticSearch 2.当我在映射中分析 vs not_analyzed 时,not_analyzed应该更好的存储( https://www.elastic.co/blog/ elasticsearch存储最真实的故事)。
为了测试的目的,我创建了一些索引,所有的String字段都被分析(默认情况下),然后我创建了一些其他所有字段的索引为not_analyzed,当我检查索引的大小时我惊讶地发现与not_analyzed Strings的索引是40%更大的!我在每个索引中插入相同的文档(35000个文档)。

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story). For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).

任何想法为什么会发生这种情况?我的文档是简单的JSON文档。我在每个文档中有60个字符串字段,我想设置为not_analyzed,我尝试将每个字段设置为未分析,并创建一个动态模板。

Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.

我编辑添加映射,虽然我觉得没有什么特别之处:

I edit for adding the mapping, although I think it has nothing special:

    {
        "mappings": {
            "my_type" : {
                          "_ttl" : { "enabled" : true, "default" : "7d" },
                          "properties" : {
                                "field1" : {
                                    "properties" : {
                                        "field2" : {
                                            "type" : "string", "index" : "not_analyzed"
                                        }
                                        more not_analyzed String fields here
                                  ...
                              ...
                          ...
}


推荐答案

not_analyzed fields are still >索引。他们只是事先没有适用于他们的变革(分析 - Lucene的说法)。

not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).

举个例子: p>

As an example:


(Doc 1)快速的棕色狐狸跳过懒狗

(Doc 1) "The quick brown fox jumped over the lazy dog"

(文件2)像狐狸一样懒惰

(Doc 2) "Lazy like the fox"









  1. 标准分析器创建的简易发帖列表(分析的默认值为字符串字段 - 已标记,低位,已删除)

  1. Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):




"brown": [1]  
"dog": [1]  
"fox": [1,2]  
"jumped": [1]  
"lazy": [1,2]  
"over": [1] 
"quick": [1]

30字符串数据的字符



  1. index创建的简易发帖列表:not_analyzed

  1. Simplified postings list created by "index": "not_analyzed":




"The quick brown fox jumped over the lazy dog": [1]  
"Lazy like the fox": [2] 

62个字符值的字符串数据

分析导致输入获得标记化和归一化,以便能够使用术语查找文档。

Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.

但是,文本的单位被缩小到一个标准化的术语(vs整个字段, not_analyzed ,并且所有文档中的所有冗余(标准化)术语都被折叠为单个逻辑列表,可以保存通常为消费的重复条款和停止词。

But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.

这篇关于分析vs not_analyzed:存储大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆