使用 ElasticSearch 搜索文件名 [英] Filename search with ElasticSearch

查看:33
本文介绍了使用 ElasticSearch 搜索文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 ElasticSearch 来搜索文件名(不是文件的内容).因此我需要找到文件名的一部分(完全匹配,没有模糊搜索).

I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search).

示例:
我有以下名称的文件:

Example:
I have files with the following names:

My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt

现在我想搜索 2012.01.13 以获得前两个文件.
搜索 fileile 应返回除最后一个之外的所有文件名.

Now I want to search for 2012.01.13 to get the first two files.
A search for file or ile should return all filenames except the last one.

我如何使用 ElasticSearch 实现这一点?

How can i accomplish that with ElasticSearch?

这是我测试过的,但它总是返回零结果:

This is what I have tested, but it always returns zero results:

curl -X DELETE localhost:9200/files
curl -X PUT    localhost:9200/files -d '
{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "filename_analyzer" : {
            "type" : "custom",
            "tokenizer" : "lowercase",
            "filter"    : ["filename_stop", "filename_ngram"]
          }
        },
        "filter" : {
          "filename_stop" : {
            "type" : "stop",
            "stopwords" : ["doc", "pdf", "docx"]
          },
          "filename_ngram" : {
            "type" : "nGram",
            "min_gram" : 3,
            "max_gram" : 255
          }
        }
      }
    }
  },

  "mappings": {
    "files": {
      "properties": {
        "filename": {
          "type": "string",
          "analyzer": "filename_analyzer"
        }
      }
    }
  }
}
'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"


FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'

for file in ${FILES}
do
  echo; echo; echo ">>> ${file}"
  curl "${file}&pretty=true"
done

推荐答案

您粘贴的内容有各种问题:

You have various problems with what you pasted:

1) 不正确的映射

创建索引时,您指定:

"mappings": {
    "files": {

但你的类型实际上是file,而不是files.如果您检查映射,您会立即看到:

But your type is actually file, not files. If you checked the mapping, you would see that immediately:

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' 

# {
#    "files" : {
#       "files" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string",
#                "analyzer" : "filename_analyzer"
#             }
#          }
#       },
#       "file" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string"
#             }
#          }
#       }
#    }
# }

2) 不正确的分析器定义

您已经指定了 lowercase 标记器,但它会删除不是字母的任何内容,(请参阅 docs),因此您的数字将被完全删除.

You have specified the lowercase tokenizer but that removes anything that isn't a letter, (see docs), so your numbers are being completely removed.

您可以使用 分析 API 进行检查:

You can check this with the analyze API:

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' 

# {
#    "tokens" : [
#       {
#          "end_offset" : 2,
#          "position" : 1,
#          "start_offset" : 0,
#          "type" : "word",
#          "token" : "my"
#       },
#       {
#          "end_offset" : 7,
#          "position" : 2,
#          "start_offset" : 3,
#          "type" : "word",
#          "token" : "file"
#       },
#       {
#          "end_offset" : 22,
#          "position" : 3,
#          "start_offset" : 19,
#          "type" : "word",
#          "token" : "doc"
#       }
#    ]
# }

3) Ngrams on search

您在索引分析器和搜索分析器中都包含了您的 ngram 标记过滤器.这对于索引分析器来说很好,因为您希望对 ngram 进行索引.但是当你搜索时,你想搜索完整的字符串,而不是每个 ngram.

You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram.

例如,如果您使用长度为 1 到 4 的 ngram 索引 "abcd",您将得到以下标记:

For instance, if you index "abcd" with ngrams of length 1 to 4, you will end up with these tokens:

a b c d ab bc cd abc bcd

但是如果您搜索 "dcba"(不应该匹配)并且您还使用 ngrams 分析您的搜索词,那么您实际上是在搜索:

But if you search on "dcba" (which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on:

d c b a dc cb ba dbc cba

所以 a,b,cd 会匹配!

So a,b,c and d will match!

解决方案

首先,您需要选择正确的分析仪.您的用户可能会搜索单词、数字或日期,但他们可能不会期望 ilefile 匹配.相反,使用 edge ngrams 可能更有用,它将 ngram 锚定到每个单词的开头(或结尾).

First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect ile to match file. Instead, it will probably be more useful to use edge ngrams, which will anchor the ngram to the start (or end) of each word.

另外,为什么要排除 docx 等?用户肯定想搜索文件类型吗?

Also, why exclude docx etc? Surely a user may well want to search on the file type?

因此,让我们通过删除不是字母或数字的任何内容(使用 pattern tokenizer):

So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the pattern tokenizer):

My_first_file_2012.01.13.doc
=> my first file 2012 01 13 doc

然后对于索引分析器,我们还将在每个标记上使用边 ngram:

Then for the index analyzer, we'll also use edge ngrams on each of those tokens:

my     => m my
first  => f fi fir firs first
file   => f fi fil file
2012   => 2 20 201 201
01     => 0 01
13     => 1 13
doc    => d do doc

我们创建索引如下:

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "filename_search" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase"]
            },
            "filename_index" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase","edge_ngram"]
            }
         },
         "tokenizer" : {
            "filename" : {
               "pattern" : "[^\p{L}\d]+",
               "type" : "pattern"
            }
         },
         "filter" : {
            "edge_ngram" : {
               "side" : "front",
               "max_gram" : 20,
               "min_gram" : 1,
               "type" : "edgeNGram"
            }
         }
      }
   },
   "mappings" : {
      "file" : {
         "properties" : {
            "filename" : {
               "type" : "string",
               "search_analyzer" : "filename_search",
               "index_analyzer" : "filename_index"
            }
         }
      }
   }
}
'

现在,测试我们的分析器是否正常工作:

Now, test that the our analyzers are working correctly:

文件名_搜索:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' 
[results snipped]
"token" : "my"
"token" : "first"
"token" : "file"
"token" : "2012"
"token" : "01"
"token" : "13"
"token" : "doc"

文件名_索引:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' 
"token" : "m"
"token" : "my"
"token" : "f"
"token" : "fi"
"token" : "fir"
"token" : "firs"
"token" : "first"
"token" : "f"
"token" : "fi"
"token" : "fil"
"token" : "file"
"token" : "2"
"token" : "20"
"token" : "201"
"token" : "2012"
"token" : "0"
"token" : "01"
"token" : "1"
"token" : "13"
"token" : "d"
"token" : "do"
"token" : "doc"

好的 - 似乎工作正常.所以让我们添加一些文档:

OK - seems to be working correctly. So let's add some docs:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"

并尝试搜索:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : "2012.01"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "PsDvfFCkT4yvJnlguxJrrQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "ER5RmyhATg-Eu92XNGRu-w",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.06780553,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

成功!

#### 更新 ####

我意识到搜索 2012.01 会同时匹配 2012.01.122012.12.01 所以我尝试更改查询以使用文本短语 查询.然而,这并没有奏效.事实证明,边缘 ngram 过滤器会增加每个 ngram 的位置计数(虽然我原以为每个 ngram 的位置与单词开头的位置相同).

I realised that a search for 2012.01 would match both 2012.01.12 and 2012.12.01 so I tried changing the query to use a text phrase query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word).

上述第 (3) 点中提到的问题仅在使用 query_stringfieldtext 查询时会出现问题,这些查询试图匹配任何令牌.但是,对于 text_phrase 查询,它会尝试以正确的顺序匹配所有标记.

The issue mentioned in point (3) above is only a problem when using a query_string, field, or text query which tries to match ANY token. However, for a text_phrase query, it tries to match ALL of the tokens, and in the correct order.

为了演示这个问题,用不同的日期索引另一个文档:

To demonstrate the issue, index another doc with a different date:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'
curl -X POST "http://localhost:9200/files/_refresh"

并进行与上述相同的搜索:

And do a the same search as above:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : {
            "query" : "2012.01"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.22097087,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.22097087,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

第一个结果的日期 2012.12.01 不是 2012.01 的最佳匹配.所以要只匹配那个确切的短语,我们可以这样做:

The first result has a date 2012.12.01 which isn't the best match for 2012.01. So to match only that exact phrase, we can do:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text_phrase" : {
         "filename" : {
            "query" : "2012.01",
            "analyzer" : "filename_index"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.55737644,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 7
# }

或者,如果您仍然想匹配所有 3 个文件(因为用户可能记得文件名中的一些单词,但顺序错误),您可以运行这两个查询,但增加文件名的重要性正确的顺序:

Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "bool" : {
         "should" : [
            {
               "text_phrase" : {
                  "filename" : {
                     "boost" : 2,
                     "query" : "2012.01",
                     "analyzer" : "filename_index"
                  }
               }
            },
            {
               "text" : {
                  "filename" : "2012.01"
               }
            }
         ]
      }
   }
}
'

# [Fri Feb 24 16:31:02 2012] Response:
# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.012931341,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.56892186,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

这篇关于使用 ElasticSearch 搜索文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆