Elasticsearch“更像这个"API 与 more_like_this 查询 [英] Elasticsearch "More Like This" API vs. more_like_this query

查看:20
本文介绍了Elasticsearch“更像这个"API 与 more_like_this 查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elasticsearch 有两个相似的特性来获取相似"的文档:

Elasticsearch has two similar features to get "similar" documents:

More Like This API".它给了我类似于给定文件的文件.但是我不能在更复杂的表达式中使用它.

There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.

还有 "more_like_this" 用于搜索 API 的查询 我可以在 bool 或 boosting 表达式中使用它,但我不能给它一个文档的 id.我必须提供 "like_text" 参数.

There is also the "more_like_this" query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text" parameter.

我有带有标签和内容的文档.有些文件会有很好的标签,有些则没有.我想要一个类似文档"功能,该功能每次都可以使用,但会将具有匹配标签的文档排名高于具有匹配文本的文档.我的想法是:

I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:

{
    "boosting" : {
        "positive" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
                "min_term_freq" : 1
            }
        },
        "negative" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
            }
        },
        "negative_boost" : 0.2
    }
}

显然这是行不通的,因为 "more_like_this" 中没有 "id".有哪些替代方案?

Obviously this doesn't work because there is no "id" in "more_like_this". What are the alternatives?

推荐答案

首先简单介绍一下这个功能以及它是如何工作的.这个想法是您有一个特定的文档,并且您希望拥有一些与它相似的其他文档.

First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.

为了实现这一点,我们需要从当前文档中提取一些内容,并使用它来进行查询以获取相似的内容.我们可以从 lucene 存储字段(或 elasticsearch _source 字段,它实际上是 lucene 中的存储字段)中提取内容并以某种方式重新分析它或使用存储在术语向量中的信息(如果在索引时启用)来获取术语列表我们可以用它来查询,而不必重新分析文本.如果术语向量可用,我不确定 elasticsearch 是否尝试后一种方法.

In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.

更像这个查询允许您提供文本,无论您从何处获得.该文本将用于查询您选择的字段并取回类似的文档.文本将不会被完全使用,而是会重新分析,并且在至少具有所提供的 min_term_freq 的术语中,只会保留最多 max_query_terms(默认 25)(最小词频,默认 2)以及 min_doc_freqmax_doc_freq 之间的文档频率.还有更多参数会影响生成的查询.

The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms (default 25) will be kept, out of the terms that have at least the provided min_term_freq (minimum term frequency, default 2) and document frequency between min_doc_freq and max_doc_freq. There are more parameters too that can influence the generated query.

更像这个 api 一步此外,允许提供文档的 id 以及字段列表.这些字段的内容将从该特定文档中提取,并用于对相同字段进行更类似的查询.这意味着生成的更像此查询将具有包含先前提取的文本的属性文本,并将在相同的字段上执行.正如你所看到的,这个 api 在后台执行了一个更像这个查询.

The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.

比方说,越像这个查询为您提供了更大的灵活性,因为您可以将它与其他查询结合起来,并且您可以从您喜欢的任何来源获取文本.另一方面,更像这个 api 公开了公共功能,为您做更多的工作,但有一些限制.

Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.

在您的情况下,我会将几个不同的类似查询组合在一起,以便您可以利用强大的 elasticsearch 查询 DSL,以不同方式提升查询等等.缺点是您必须自己提供文本,因为您无法提供从中提取它的文档的 ID.

In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.

有不同的方法来实现你想要的.我会使用 bool 查询 将两者结合起来像这样在一个 should 子句中查询并赋予它们不同的权重.我也会使用 更像这个字段查询 相反,因为您想一次查询一个字段.

There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.

{
    "bool" : {
        "must" : {
          {"match_all" : { }}
        },
        "should" : [
            {
              "more_like_this_field" : {
                "tags" : {
                  "like_text" : "here go the tags extracted from the current document!",
                  "boost" : 2.0
                }
              }
            },
            {
              "more_like_this_field" : {
                "content" : {
                  "like_text" : "here goes the content extracted from the current document!"
                }
              }
            }
        ],
        "minimum_number_should_match" : 1
    }
}

这样一来,至少有一个 should 子句必须匹配,并且标签匹配比内容匹配更重要.

This way at least one of the should clauses must match, and a match on tags is more important than a match on content.

这篇关于Elasticsearch“更像这个"API 与 more_like_this 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆