已分析或未分析,选择什么 [英] analyzed or not_analyzed, what to choose

查看:30
本文介绍了已分析或未分析,选择什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只使用 kibana 来搜索 ElasticSearch,我有几个字段只能取几个值(最坏的情况,服务器名称,30 个不同的值).

I'm using only kibana to search ElasticSearch and i have several fields that can only take a few values (worst case, servername, 30 different values).

我确实了解分析对更大、更复杂的字段的作用像这样,但是那些小而简单的我无法理解分析/未分析字段的进步/劣势.

I do understand what analyze do to bigger, more complex fields like this, but the small and simple ones i fail to understand the advance/disadvantage of anaylyzed/not_analyzed fields.

那么,对于有限的一组值"字段(例如, servername: server[0-9]* ,没有特殊字符可以破坏)使用分析和 not_analyzed 有什么好处?我会在 kibana 中丢失什么样的搜索类型?我会获得任何搜索速度或磁盘空间吗?

So what are the benefits of using analyzed and not_analyzed for a "limited set of values" field (example. servername: server[0-9]* , no special characters to break)? What kind of search types will i lose in kibana? Will i gain any search speed or disk space?

在其中之一上进行测试时,我发现该字段的 .raw 版本现在为空,但 kibana 仍将该字段标记为已分析,因此我发现我的测试尚无定论.

Testing on one of then i saw that the .raw version of the field is now empty but kibana still flags the field as analyzed, so i find my tests inconclusive.

推荐答案

我会尽量保持简单,如果您需要更多说明,请告诉我,我会详细说明一个更好的答案.

I will to try to keep it simple, if you need more clarification just let me know and I'll elaborate a better answer.

分析"字段将使用您为映射中的特定表定义的分析器创建一个标记.如果您使用默认分析器(因为您指的是没有特殊字符的东西,可以说 server[1-9])使用默认分析器(alnum-lowercase word-braker(这不是它的基本名称))是要标记化:

the "analyzed" field is going to create a token using the analyzer that you had defined for that specific table in your mapping. if you are using the default analyzer (as you refer to something without especial characters lets say server[1-9]) using the default analyzer (alnum-lowercase word-braker(this is not the name just what it does basically)) is going to tokenize :

this -> HelloWorld123
into -> token1:helloworld123

OR

this -> Hello World 123
into -> token1:hello && token2:world && token3:123

在这种情况下,如果您进行搜索: HeLlO 它将变成 -> "hello" 并且它将匹配此文档,因为标记 "hello" 在那里.

in this case if you do a search: HeLlO it will become -> "hello" and it will match this document because the token "hello" is there.

在 not_analized 字段的情况下,它根本不应用任何标记器,您的​​标记是您的关键字,因此可以说:

in the case of not_analized fields it doesnt apply any tokenizer at all, your token is your keyword so that being said:

this -> Hello World 123
into -> token1:(Hello World 123)

如果您在该字段中搜索hello world 123"

if you search that field for "hello world 123"

不会匹配,因为它区分大小写"(您仍然可以使用通配符 (Hello*),让我们改天再解决这个问题).

is not going to match because is "case sensitive" (you can still use wildcards though (Hello*), lets address that in another time).

简而言之:

对您要搜索的字段使用已分析"字段,并且您希望 elasticsearch 对它们进行评分.例如:包含单词jobs"的标题.查询:title:jobs".

use "analyzed" fields for fields that you are going to search and you want elasticsearch to score them. example: titles that contain the word "jobs". query:"title:jobs".

doc1 : title:developer jobs in montreal
doc2 : title:java coder jobs in vancuver
doc3 : title:unix designer jobs in toronto
doc4 : title:database manager vacancies in montreal

这将检索 title1 title2 title3.

this is going to retrieve title1 title2 title3.

在这些情况下,分析"字段就是您想要的.

in those case "analyzed" fields is what you want.

如果您事先知道该字段上会包含哪些类型的数据,并且您要准确查询您想要的数据,那么not_analyzed"就是您想要的.

if you know in advance what kind of data would be on that field and you're going to query exactly what you want then "not_analyzed" is what you want.

示例:

从 server123 获取所有日志.

get all the logs from server123.

查询:服务器:server123".

query:"server:server123".

doc1 :server:server123,log:randomstring,date:01-jan
doc2 :server:server986,log:randomstring,date:01-jan
doc3 :server:server777,log:randomstring,date:01-jan
doc4 :server:server666,log:randomstring,date:01-jan
doc5 :server:server123,log:randomstring,date:02-jan

仅来自 server1 和 server5 的结果.

results only from server1 and server5.

好吧,我希望你明白这一点.正如我所说,保持简单就是你所需要的.

and well i hope you get the point. as i said keep it simple is about what you need.

已分析 -> 磁盘空间更多(如果分析文件很大,则更多).分析 -> 更多时间用于索引.分析 -> 更好地匹配文档.

analyzed -> more space on disk (LOT MORE if the analyze filds are big). analyzed -> more time for indexation. analyzed -> better for matching documents.

not_analyzed -> 磁盘空间减少.not_analyzed -> 更少的索引时间.not_analyzed -> 精确匹配字段或使用通配符.

not_analyzed -> less space on disk. not_analyzed -> less time for indexation. not_analyzed -> exact match for fields or using wildcards.

问候,

丹尼尔

这篇关于已分析或未分析,选择什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆