分析vs not_analyzed：存储大小 [英] analyzed vs not_analyzed: storage size

查看：216 发布时间：2017/8/7 2:06:37 elasticsearch

本文介绍了分析vs not_analyzed：存储大小的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近开始使用ElasticSearch 2.当我在映射中分析 vs not_analyzed 时，not_analyzed应该更好的存储（和 https://www.elastic.co/blog/ elasticsearch存储最真实的故事）。
为了测试的目的，我创建了一些索引，所有的String字段都被分析（默认情况下），然后我创建了一些其他所有字段的索引为not_analyzed，当我检查索引的大小时我惊讶地发现与not_analyzed Strings的索引是40％更大的！我在每个索引中插入相同的文档（35000个文档）。

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story). For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).

任何想法为什么会发生这种情况？我的文档是简单的JSON文档。我在每个文档中有60个字符串字段，我想设置为not_analyzed，我尝试将每个字段设置为未分析，并创建一个动态模板。

Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.

我编辑添加映射，虽然我觉得没有什么特别之处：

I edit for adding the mapping, although I think it has nothing special:

    {
        "mappings": {
            "my_type" : {
                          "_ttl" : { "enabled" : true, "default" : "7d" },
                          "properties" : {
                                "field1" : {
                                    "properties" : {
                                        "field2" : {
                                            "type" : "string", "index" : "not_analyzed"
                                        }
                                        more not_analyzed String fields here
                                  ...
                              ...
                          ...
}

推荐答案

not_analyzed fields are still >索引。他们只是事先没有适用于他们的变革（分析 - Lucene的说法）。

not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).

举个例子： p>

As an example:

（Doc 1）快速的棕色狐狸跳过懒狗

(Doc 1) "The quick brown fox jumped over the lazy dog"

（文件2）像狐狸一样懒惰

(Doc 2) "Lazy like the fox"

标准分析器创建的简易发帖列表（分析的默认值为字符串字段 - 已标记，低位，已删除）

Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):

"brown": [1]  
"dog": [1]  
"fox": [1,2]  
"jumped": [1]  
"lazy": [1,2]  
"over": [1] 
"quick": [1]

30字符串数据的字符

由index创建的简易发帖列表：not_analyzed：

Simplified postings list created by "index": "not_analyzed":

"The quick brown fox jumped over the lazy dog": [1]  
"Lazy like the fox": [2]

62个字符值的字符串数据

分析导致输入获得标记化和归一化，以便能够使用术语查找文档。

Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.

但是，文本的单位被缩小到一个标准化的术语（vs整个字段， not_analyzed ），并且所有文档中的所有冗余（标准化）术语都被折叠为单个逻辑列表，可以保存通常为消费的重复条款和停止词。

But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.

这篇关于分析vs not_analyzed：存储大小的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分析vs not_analyzed：存储大小 [英] analyzed vs not_analyzed: storage size

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

分析vs not_analyzed：存储大小 [英] analyzed vs not_analyzed: storage size

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭