使用ElasticSearch进行文件名搜索 [英] Filename search with ElasticSearch

查看：200 发布时间：2017/8/6 22:25:45 lucene elasticsearch n-gram

本文介绍了使用ElasticSearch进行文件名搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用ElasticSearch来搜索文件名（而不是文件的内容）。因此，我需要找到文件名的一部分（完全匹配，没有模糊搜索）。

示例：

我具有以下名称的文件：

  My_first_file_created_at_2012.01.13.doc 
 My_second_file_created_at_2012.01.13.pdf 
另一个file.txt 
 And_again_another_file.docx 
 foo.bar.txt

现在我要搜索 2012.01.13 获取前两个文件。

搜索文件或 ile 应该返回除最后一个之外的所有文件名。

如何用ElasticSearch完成这个？

这是我测试的，但它总是返回零结果：

  curl -X DELETE localhost ：9200 / files 
 curl -X PUT localhost：9200 / files -d'
 {
settings：{
index：{
analysis ：{
analyzer：{
filename_analyzer：{
type：custom，
 tokenizer：smallcase，
filter：[filename_stop，filename_ngram] 
} 
}，
filter：{
 filename_stop：{
type：stop，
stopwords：[doc，pdf，docx] 
}，
filename_ngram ：{
type：nGram，
min_gram：3，
max_gram：255 
} 
} 
} 
 
}，
 
mappings：{
files：{
properties：{
filename：{
type：string，
analyzer：filename_analyzer
} 
} 
} 
} 
} 
'
 
 curl -X POSThttp：// localhost：9200 / files / file-d'{filename：My_first_file_created_at_2012.01.13.doc}'
 curl  - X POSThttp：// localhost：9200 / files / file-d'{filename：My_second_file_created_at_2012.01.13.pdf 
 curl -X POSThttp：// localhost：9200 / files / file-d'{filename：另一个file.txt}'
 curl -X POSThttp： // localhost：9200 / files / file-d{filename：And_again_another_file.docx}'
 curl -X POSThttp：// localhost：9200 / files / file-d'{ filename：foo.bar.txt}'
 curl -X POSThttp：// localhost：9200 / files / _refresh
 
 
 FILES ='
 http：// localhost：9200 / files / _search？q = filename：2012.01.13 
'
 
 $文件中的文件
 do 
回音回声; echo>>> $ {file}
 curl$ {file}& pretty = true
 done

解决方案

您对您粘贴的内容有各种各样的问题：

1）不正确的映射

创建索引时，请指定：

 code>mappings：{
files：{

但是类型实际上是文件，而不是文件。如果您检查了映射，您将立即看到：

  curl -XGET'http://127.0.0.1:9200/文件/ _mapping？pretty = 1'
 
＃{
＃files：{
＃files：{
＃properties：{
＃filename：{
＃type：string，
＃analyzer：filename_analyzer
＃} 
＃} 
＃ }，
＃file：{
＃properties：{
＃filename：{
＃type：string
＃ 
＃} 
＃} 
＃} 
＃}

2）分析仪定义不正确

您已经指定了小写字母它会删除不是信件的任何内容（请参阅文档），所以你的数字是完整的tely删除。

您可以使用分析API ：

  curl -XGET'http://127.0.0.1:9200/_analyze ？pretty = 1& text = My_file_2012.01.13.doc& tokenizer = smallcase'
 
＃{
＃tokens：[
＃{
＃end_offset ：2，
＃position：1，
＃start_offset：0，
＃type：word，
＃token 
＃}，
＃{
＃end_offset：7，
＃position：2，
＃start_offset：3，
＃type：word，
＃token：file
＃}，
＃{
＃end_offset：22，
 ＃position：3，
＃start_offset：19，
＃type：word，
＃token：doc
＃} 
＃] 
＃}

3）搜索数据

您可以在索引分析器和搜索分析器中包含ngram令牌过滤器。对于索引分析器来说，这是非常好的，因为您希望对索引进行索引。但是当您搜索时，您要搜索完整的字符串，而不是每个ngram。

例如，如果索引abcd，长度为1到4，您将得到这些令牌：

  abcd ab bc cd abc bcd

但是，如果您在dcba（不应该匹配），你也可以用ngram分析你的搜索字词，然后你实际上是搜索：

  dcba dc cb ba dbc cba

所以 a ， b ， c 和 d 将匹配！ p>

解决方案

首先，您需要选择合适的分析器。您的用户可能会搜索字词，数字或日期，但他们可能不会期望 ile 匹配文件。相反，使用边缘图可能会更有用，其中将ngram锚定到每个单词的开始（或结束）。

另外，为什么要排除 docx 等等？确定用户可能想要搜索文件类型？

所以我们可以通过删除不是字母或数字的任何东西来将每个文件名分解成更小的标记（使用模式分类器）：

  My_first_file_2012.01.13.doc 
 =>我的第一个文件2012 01 13 doc

然后对于索引分析器，我们还将使用边缘数位每个令牌：

  my => m我的
 first => f fi fir firs first 
 file => f fi fil file 
 2012 => 2 20 201 201 
 01 => 0 01 
 13 => 1 13 
 doc => d do doc

我们创建索引如下：

  curl -XPUT'http://127.0.0.1:9200/files/?pretty=1'-d'
 {
settings： {
analysis：{
analyzer：{
filename_search：{
tokenizer：filename，
filter小写字母] 
}，
filename_index：{
tokenizer：filename，
filter：[smallcase，edge_ngram] 
 
}，
tokenizer：{
filename：{
pattern：[^ \\p {L} \\\d ] +，
type：pattern
} 
}，
filter：{
edge_ngram：{
 ：front，
max_gram：20，
min_gram：1，
type：edgeNGram
 
} 
} 
}，
mappings：{
file：{
properties：{
文件名：{
type：string，
search_analyzer：filename_search，
index_analyzer：filename_index
} 
} 
} 
} 
} 
'

测试我们的分析仪是否正常工作：

filename_search：

  curl -XGET'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search'
 [results snipped] 
token：my
token：first
token：file
token：2012
token ：01
token：13
token：doc

filename_index：

  curl -XGET'http://127.0.0.1： 9200 /文件/ _分析？pretty = 1& text = My_first_file_2012.01.13.doc& analyzer = filename_index'
token：m
token：my
token：f 
token：fi
token：fir
token：firs
token：first
令牌：f
token：fi
token：fil
token：file
token：2 
token：20
token：201
token：2012
token：0
 ：01
token：1
token：13
token：d
token：do
token：doc

OK - 似乎正常工作。所以我们来添加一些文档：

  curl -X POSThttp：// localhost：9200 / files / file-d' {filename：My_first_file_created_at_2012.01.13.doc}'
 curl -X POSThttp：// localhost：9200 / files / file-d'{filename：My_second_file_created_at_2012.01.13.pdf 
 curl -X POSThttp：// localhost：9200 / files / file-d'{filename：另一个file.txt}'
 curl -X POSThttp： // localhost：9200 / files / file-d{filename：And_again_another_file.docx}'
 curl -X POSThttp：// localhost：9200 / files / file-d'{ filename：foo.bar.txt}'
 curl -X POSThttp：// localhost：9200 / files / _refresh

并尝试搜索：

  curl -XGET'http：// 127.0.0.1:9200/files/file/_search?pretty=1'-d'
 {
查询：{
text：{
filename： 2012.01
} 
} 
} 
'
 
＃{
＃hits：{
＃hits ：[
＃ {
＃_source：{
＃filename：My_second_file_created_at_2012.01.13.pdf
＃}，
＃_score：0.06780553，
＃ _index：files，
＃_id：PsDvfFCkT4yvJnlguxJrrQ，
＃_type：file
＃}，
＃{
 ＃_source：{
＃filename：My_first_file_created_at_2012.01.13.doc
＃}，
＃_score：0.06780553，
＃_index：文件，
＃_id：ER5RmyhATg-Eu92XNGRu-w，
＃_type：file
＃} 
＃]，
＃ max_score：0.06780553，
＃total：2 
＃}，
＃timed_out：false，
＃_shards：{
＃失败：0，
＃success：5，
＃total：5 
＃}，
＃taken：4 
＃}

成功！

#### UPDATE ####

我意识到搜索 2012.01 将匹配 2012.01.12 和 2012.12.01 ，所以我尝试更改查询以使用短语查询。但是，这不行。事实证明，边缘ngram过滤器会增加每个ngram的位置计数（而我认为每个ngram的位置与起始字的位置相同）。

上述（3）中提到的问题仅在使用 query_string ，字段时出现问题，或 text 查询，尝试匹配任何令牌。但是，对于 text_phrase 查询，它会尝试匹配所有令牌，并按正确的顺序。

为了演示这个问题，用另一个日期索引另一个文档：

  curl -X POSThttp：// localhost：9200 /文件/文件-d{filename：My_third_file_created_at_2012.12.01.doc}'
 curl -X POSThttp：// localhost：9200 / files / _refresh

并进行与上述相同的搜索：

  curl -XGET'http://127.0.0.1:9200/files/file/_search?pretty=1'-d'
 {
查询：{
文本：{
filename：{
查询：2012.01
} 
} 
} 
} 
'
 
＃{
＃hits：{
＃hits：[
＃{
＃_source：{
＃ filename：My_third_file_created_at_2012.12.01.doc
＃}，
＃_score：0.22097087，
＃ _index：文件，
＃_id：xmC51lIhTnWplOHADWJzaQ，
＃_type：file
＃}，
＃{
 ＃_source：{
＃filename：My_first_file_created_at_2012.01.13.doc
＃}，
＃_score：0.13137488，
＃_index：文件，
＃_id：ZUezxDgQTsuAaCTVL9IJgg，
＃_type：file
＃}，
＃{
＃_source {
＃filename：My_second_file_created_at_2012.01.13.pdf
＃}，
＃_score：0.13137488，
＃_index：files，
＃_id：XwLNnSlwSeyYtA2y64WuVw，
＃_type：file
＃} 
＃]，
＃max_score：0.22097087，
＃total：3 
＃}，
＃timed_out：false，
＃_shards：{
＃failed：0，
 ＃ 成功：5，
＃total：5 
＃}，
＃taken：5 
＃}

第一个结果有一个日期 2012.12.01 这不是最适合 2012.01 。所以要匹配这个确切的短语，我们可以做：

  curl -XGET'http://127.0.0.1:9200/文件/文件/ _search？pretty = 1'-d'
 {
查询：{
text_phrase：{
filename：{
查询：2012.01，
analyzer：filename_index
} 
} 
} 
} 
'
 
＃{
＃hits：{
＃hits：[
＃{
＃_source：{
＃filename：My_first_file_created_at_2012 
＃}，
＃_score：0.55737644，
＃_index：files，
＃_id：ZUezxDgQTsuAaCTVL9IJgg，
＃_type：file
＃}，
＃{
＃_source：{
＃filename：My_second_file_created_at_2012.01.13.pdf 
＃}，
＃_score：0.55737644，
＃_index：fil es，
＃_id：XwLNnSlwSeyYtA2y64WuVw，
＃_type：file
＃} 
＃]，
＃max_score 0.55737644，
＃total：2 
＃}，
＃timed_out：false，
＃_shards：{
＃failed ，
＃success：5，
＃total：5 
＃}，
＃taken：7 
＃}

或者，如果您仍然想要匹配所有3个文件（因为用户可能会记住文件名中的某些单词，在错误的顺序），您可以运行这两个查询，但增加文件名的重要性是正确的顺序：

 卷曲-XGET'http://127.0.0.1:9200/files/file/_search?pretty=1'-d'
 {
查询：{
bool：{
应该：[
 {
text_phrase：{
filename：{
boost：2，
 ：2012.01，
analyzer：filename_index
} 
} 
}，
 {
text：{
filename：2012.01
} 
} 
] 
} 
} 
} 
'
 
＃[Fri Feb 24 16:31:02 2012]回复：
＃{
＃hits：{
＃hits：[
＃{
＃_source：{
＃filename：My_first_file_created_at_2012.01.13.doc
＃}，
＃_score：0.56892186，
＃_index 文件，
＃_id：ZUezxDgQTsuAaCTVL9IJgg，
＃_type：file
＃}，
＃{
＃_source ：{
＃filename：My_second_file_created_at_2012.01.13.pdf
＃}，
＃_score：0.56892186，
＃_index：fil es，
＃_id：XwLNnSlwSeyYtA2y64WuVw，
＃_type：file
＃}，
＃{
＃_source 
＃filename：My_third_file_created_at_2012.12.01.doc
＃}，
＃_score：0.012931341，
＃_index：files，
＃_id：xmC51lIhTnWplOHADWJzaQ，
＃_type：file
＃} 
＃]，
＃max_score：0.56892186，
＃total：3 
＃}，
＃timed_out：false，
＃_shards：{
＃failed：0，
 ＃success：5，
＃total：5 
＃}，
＃taken：4 
＃}

I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search).

Example:
I have files with the following names:

My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt

Now I want to search for 2012.01.13 to get the first two files.
A search for file or ile should return all filenames except the last one.

How can i accomplish that with ElasticSearch?

This is what I have tested, but it always returns zero results:

curl -X DELETE localhost:9200/files
curl -X PUT    localhost:9200/files -d '
{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "filename_analyzer" : {
            "type" : "custom",
            "tokenizer" : "lowercase",
            "filter"    : ["filename_stop", "filename_ngram"]
          }
        },
        "filter" : {
          "filename_stop" : {
            "type" : "stop",
            "stopwords" : ["doc", "pdf", "docx"]
          },
          "filename_ngram" : {
            "type" : "nGram",
            "min_gram" : 3,
            "max_gram" : 255
          }
        }
      }
    }
  },

  "mappings": {
    "files": {
      "properties": {
        "filename": {
          "type": "string",
          "analyzer": "filename_analyzer"
        }
      }
    }
  }
}
'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"


FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'

for file in ${FILES}
do
  echo; echo; echo ">>> ${file}"
  curl "${file}&pretty=true"
done

解决方案

You have various problems with what you pasted:

1) Incorrect mapping

When creating the index, you specify:

"mappings": {
    "files": {

But your type is actually file, not files. If you checked the mapping, you would see that immediately:

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' 

# {
#    "files" : {
#       "files" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string",
#                "analyzer" : "filename_analyzer"
#             }
#          }
#       },
#       "file" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string"
#             }
#          }
#       }
#    }
# }

2) Incorrect analyzer definition

You have specified the lowercase tokenizer but that removes anything that isn't a letter, (see docs), so your numbers are being completely removed.

You can check this with the analyze API:

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' 

# {
#    "tokens" : [
#       {
#          "end_offset" : 2,
#          "position" : 1,
#          "start_offset" : 0,
#          "type" : "word",
#          "token" : "my"
#       },
#       {
#          "end_offset" : 7,
#          "position" : 2,
#          "start_offset" : 3,
#          "type" : "word",
#          "token" : "file"
#       },
#       {
#          "end_offset" : 22,
#          "position" : 3,
#          "start_offset" : 19,
#          "type" : "word",
#          "token" : "doc"
#       }
#    ]
# }

3) Ngrams on search

You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram.

For instance, if you index "abcd" with ngrams of length 1 to 4, you will end up with these tokens:

a b c d ab bc cd abc bcd

But if you search on "dcba" (which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on:

d c b a dc cb ba dbc cba

So a,b,c and d will match!

Solution

First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect ile to match file. Instead, it will probably be more useful to use edge ngrams, which will anchor the ngram to the start (or end) of each word.

Also, why exclude docx etc? Surely a user may well want to search on the file type?

So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the pattern tokenizer):

My_first_file_2012.01.13.doc
=> my first file 2012 01 13 doc

Then for the index analyzer, we'll also use edge ngrams on each of those tokens:

my     => m my
first  => f fi fir firs first
file   => f fi fil file
2012   => 2 20 201 201
01     => 0 01
13     => 1 13
doc    => d do doc

We create the index as follows:

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "filename_search" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase"]
            },
            "filename_index" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase","edge_ngram"]
            }
         },
         "tokenizer" : {
            "filename" : {
               "pattern" : "[^\\p{L}\\d]+",
               "type" : "pattern"
            }
         },
         "filter" : {
            "edge_ngram" : {
               "side" : "front",
               "max_gram" : 20,
               "min_gram" : 1,
               "type" : "edgeNGram"
            }
         }
      }
   },
   "mappings" : {
      "file" : {
         "properties" : {
            "filename" : {
               "type" : "string",
               "search_analyzer" : "filename_search",
               "index_analyzer" : "filename_index"
            }
         }
      }
   }
}
'

Now, test that the our analyzers are working correctly:

filename_search:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' 
[results snipped]
"token" : "my"
"token" : "first"
"token" : "file"
"token" : "2012"
"token" : "01"
"token" : "13"
"token" : "doc"

filename_index:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' 
"token" : "m"
"token" : "my"
"token" : "f"
"token" : "fi"
"token" : "fir"
"token" : "firs"
"token" : "first"
"token" : "f"
"token" : "fi"
"token" : "fil"
"token" : "file"
"token" : "2"
"token" : "20"
"token" : "201"
"token" : "2012"
"token" : "0"
"token" : "01"
"token" : "1"
"token" : "13"
"token" : "d"
"token" : "do"
"token" : "doc"

OK - seems to be working correctly. So let's add some docs:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"

And try a search:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : "2012.01"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "PsDvfFCkT4yvJnlguxJrrQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "ER5RmyhATg-Eu92XNGRu-w",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.06780553,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

Success!

#### UPDATE ####

I realised that a search for 2012.01 would match both 2012.01.12 and 2012.12.01 so I tried changing the query to use a text phrase query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word).

The issue mentioned in point (3) above is only a problem when using a query_string, field, or text query which tries to match ANY token. However, for a text_phrase query, it tries to match ALL of the tokens, and in the correct order.

To demonstrate the issue, index another doc with a different date:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'
curl -X POST "http://localhost:9200/files/_refresh"

And do a the same search as above:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : {
            "query" : "2012.01"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.22097087,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.22097087,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

The first result has a date 2012.12.01 which isn't the best match for 2012.01. So to match only that exact phrase, we can do:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text_phrase" : {
         "filename" : {
            "query" : "2012.01",
            "analyzer" : "filename_index"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.55737644,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 7
# }

Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "bool" : {
         "should" : [
            {
               "text_phrase" : {
                  "filename" : {
                     "boost" : 2,
                     "query" : "2012.01",
                     "analyzer" : "filename_index"
                  }
               }
            },
            {
               "text" : {
                  "filename" : "2012.01"
               }
            }
         ]
      }
   }
}
'

# [Fri Feb 24 16:31:02 2012] Response:
# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.012931341,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.56892186,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

这篇关于使用ElasticSearch进行文件名搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用ElasticSearch进行文件名搜索 [英] Filename search with ElasticSearch

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

使用ElasticSearch进行文件名搜索 [英] Filename search with ElasticSearch

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭