局部敏感的哈希-Elasticsearch [英] Locality-sensitive hashing - Elasticsearch

查看:191
本文介绍了局部敏感的哈希-Elasticsearch的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Elasticsearch上是否有允许LSH的插件?如果是,您能指出我的位置并告诉我一些使用方法吗? 谢谢

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks

我发现ES使用MinHash插件.我该如何比较文档呢?找到重复项的好设置是什么?

I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates?

推荐答案

  1. 有一个 Elasticsearch MinHash插件.每次为文档建立索引并稍后用minhash查询该文档时,都可以使用它来提取minhash值.

  1. There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later.

  1. 安装MinHash插件:

  1. Install MinHash plugin:

$ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1

  • 在创建索引时添加一个minhash分析器:

  • Add a minhash analyzer when creating your index:

    $ curl -XPUT 'localhost:9200/my_index' -d '{
      "index":{
        "analysis":{
          "analyzer":{
            "minhash_analyzer":{
              "type":"custom",
              "tokenizer":"standard",
              "filter":["minhash"]
            }
          }
        }
      }
    }'  
    

  • minhash_value字段放入索引映射:

  • Put minhash_value field into an index mapping:

    $ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{
      "my_type":{
        "properties":{
          "message":{
            "type":"string",
            "copy_to":"minhash_value"
          },
          "minhash_value":{
            "type":"minhash",
            "minhash_analyzer":"minhash_analyzer"
          }
        }
      }
    }'
    

  • 将文档添加到使用minhash分析器创建的索引时,将自动计算minhash值.
  • a. 使用类似此查询的内容可用于在minhash_value字段上进行喜欢"搜索:

  • The minhash value is calculated automatically when adding document to the index you have created with minhash analyzer.
  • a. Use More like this query can be used to do "like" search on the minhash_value field:

    GET /_search
    {
        "query": {
            "more_like_this" : {
                "fields" : ["minhash_value"],
                "like" : "KV5rsUfZpcZdVojpG8mHLA==",
                "min_term_freq" : 1,
                "max_query_terms" : 12
            }
        }
    }
    

    b.您还可以使用模糊查询,但它接受查询的结果与2(最大值)相差最大.

    b. You can also use fuzzy query but it accepts the query to differ from the result by 2 (maximum).

    GET /_search
    {
        "query": {
           "fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" }
        }
    } 
    

    您可以找到有关模糊查询的更多信息这里.

    You can find more about the fuzzy query here.

  • 或者您可以在elasicsearch之外创建哈希值(编写代码以提取哈希值),并且每次对文档建立索引时,都可以运行该代码并将哈希值附加到要建立索引的文档中.然后使用更多类似此查询如上所述的模糊查询.
  • 最后但并非最不重要的一点是,您可以像上面那样编写自己的elasticsearch插件(适合您的哈希算法),并按照上面的相同步骤进行操作.
  • Or you can create the hash value outside of elasicsearch (write a code to extract hash value) and everytime you index a document you can run the code and attach the hash value to the document you are indexing. And later search with the hash value using More Like This query or Fuzzy query as described above.
  • Last but not least, you can write elasticsearch plugin yourself as above (which suits you hashing algorithm) and do the same step above.
  • 这篇关于局部敏感的哈希-Elasticsearch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆