Lucene分析仪的比较 [英] Comparison of Lucene Analyzers

查看:61
本文介绍了Lucene分析仪的比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释一下Lucene中不同分析仪之间的区别吗?我遇到了maxClauseCount异常,并且我知道可以通过使用KeywordAnalyzer避免这种情况,但是我不想在不了解分析器相关问题的情况下从StandardAnalyzer进行更改.非常感谢.

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.

推荐答案

通常,Lucene中的任何分析器都是标记器+词干分析器+停用词过滤器.

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.

令牌生成器将您的文本分成多个块,并且由于不同的分析器可能使用不同的令牌生成器,因此您可以获得不同的输出令牌流,即文本块的序列.例如,您提到的KeywordAnalyzer根本不会分割文本,而是将所有字段作为单个标记.同时,StandardAnalyzer(和大多数其他分析器)使用空格和标点符号作为分割点.例如,对于短语我非常高兴",它将产生列表["i","am",非常",高兴"](或类似的东西).有关特定分析器/令牌的更多信息,请参见其 Java文档 .

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

词干用于获取有问题的单词的基数.这在很大程度上取决于所使用的语言.例如,对于英语中的先前短语,将生成类似["i","be","veri","happi"]的内容,而对于法语"Je suistrèsheureux"则是某种法语分析器(如 SnowballAnalyzer ,初始化为法语)会产生[" je,"être," tre," heur].当然,如果您将使用一种语言的分析器来提取另一种语言的文本,则将使用另一种语言的规则,并且词干分析器可能会产生错误的结果.并非所有系统都失败,但是搜索结果可能不那么准确.

Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

KeywordAnalyzer不使用任何词干,它传递所有未修改的字段.因此,如果您要搜索英文文本中的某些单词,那么使用此分析仪不是一个好主意.

KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

停用词是最常见且几乎没有用的词.同样,它在很大程度上取决于语言.对于英语,这些单词是"a","the","I","be","have"等.停用词过滤器将它们从令牌流中删除,以降低搜索结果中的噪音,因此最终我们的短语"I" StandardAnalyzer的非常高兴"将转换为列表["veri","happi"].

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].

然后KeywordAnalyzer再也不执行任何操作.因此,KeywordAnalyzer用于ID或电话号码之类的东西,而不用于普通文本.

And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

关于您的maxClauseCount例外,我相信您会在搜索中得到它.在这种情况下,很可能是因为搜索查询太复杂.尝试将其拆分为多个查询,或使用更多低级功能.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

这篇关于Lucene分析仪的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆