如何使用 ElasticSearch 搜索单词的一部分 [英] How to search for a part of a word with ElasticSearch
问题描述
我最近开始使用 ElasticSearch,但似乎无法让它搜索单词的一部分.
I've recently started using ElasticSearch and I can't seem to make it search for a part of a word.
示例:我在 ElasticSearch 中索引了来自我的 couchdb 的三个文档:
Example: I have three documents from my couchdb indexed in ElasticSearch:
{
"_id" : "1",
"name" : "John Doeman",
"function" : "Janitor"
}
{
"_id" : "2",
"name" : "Jane Doewoman",
"function" : "Teacher"
}
{
"_id" : "3",
"name" : "Jimmy Jackal",
"function" : "Student"
}
现在,我想搜索所有包含Doe"的文档
So now, I want to search for all documents containing "Doe"
curl http://localhost:9200/my_idx/my_type/_search?q=Doe
这不会返回任何点击.但是如果我搜索
That doesn't return any hits. But if I search for
curl http://localhost:9200/my_idx/my_type/_search?q=Doeman
它确实返回一个文档(John Doeman).
It does return one document (John Doeman).
我尝试将不同的分析器和不同的过滤器设置为我的索引的属性.我也试过使用完整的查询(例如:
I've tried setting different analyzers and different filters as properties of my index. I've also tried using a full blown query (for example:
{
"query": {
"term": {
"name": "Doe"
}
}
}
)但似乎没有任何效果.
) But nothing seems to work.
当我搜索Doe"时,如何让 ElasticSearch 找到 John Doeman 和 Jane Doewoman?
How can I make ElasticSearch find both John Doeman and Jane Doewoman when I search for "Doe" ?
更新
我尝试使用 nGram 分词器和过滤器,就像 Igor 建议的那样:
I tried to use the nGram tokenizer and filter, like Igor proposed, like this:
{
"index": {
"index": "my_idx",
"type": "my_type",
"bulk_size": "100",
"bulk_timeout": "10ms",
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer",
"filter": [
"my_ngram_filter"
]
}
},
"filter": {
"my_ngram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
}
}
}
}
我现在遇到的问题是每个查询都返回所有文档.任何指针?关于使用 nGram 的 ElasticSearch 文档不是很好...
The problem I'm having now is that each and every query returns ALL documents. Any pointers? ElasticSearch documentation on using nGram isn't great...
推荐答案
我也在使用 nGram.我使用标准分词器和 nGram 作为过滤器.这是我的设置:
I'm using nGram, too. I use standard tokenizer and nGram just as a filter. Here is my setup:
{
"index": {
"index": "my_idx",
"type": "my_type",
"analysis": {
"index_analyzer": {
"my_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"mynGram"
]
}
},
"search_analyzer": {
"my_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"mynGram"
]
}
},
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 50
}
}
}
}
}
让您找到最多 50 个字母的单词部分.根据需要调整 max_gram.用德语来说可以变得非常大,所以我将它设置为一个很高的值.
Let's you find word parts up to 50 letters. Adjust the max_gram as you need. In german words can get really big, so I set it to a high value.
这篇关于如何使用 ElasticSearch 搜索单词的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!