弹性体“更像这样” API与more_like_this查询 [英] Elasticsearch "More Like This" API vs. more_like_this query

查看:115
本文介绍了弹性体“更像这样” API与more_like_this查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elasticsearch有两个类似的功能来获取相似的文档:

Elasticsearch has two similar features to get "similar" documents:

有一个更像这个API。它给我类似于给定文件的文件。我不能在更复杂的表达式中使用它。

There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.

还有一个 more_like_this查询用于Search API 我可以在bool或提升表达式,但我不能给它一个文档的id。我必须提供like_text参数。

There is also the "more_like_this" query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text" parameter.

我有标签和内容的文档。有些文件会有好的标签,有些文件不会有。我想要一个类似的文档功能,每次都可以工作,但会将匹配的标签的文档排列高于具有匹配文本的文档。我的想法是:

I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:

{
    "boosting" : {
        "positive" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
                "min_term_freq" : 1
            }
        },
        "negative" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
            }
        },
        "negative_boost" : 0.2
    }
}

显然这不起作用,因为more_like_this中没有id。什么是替代方案?

Obviously this doesn't work because there is no "id" in "more_like_this". What are the alternatives?

推荐答案

首先,有一点关于更多的这个功能及其工作原理。这个想法是你有一个特定的文档,你想要一些其他类似的文档。

First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.

为了实现这一点,我们需要从当前文档,并使用它来进行查询以获得类似的。我们可以从lucene存储的字段中提取内容(或弹性搜索_source字段,这实际上是lucene中的存储字段),并以某种方式重新分析它,或者使用存储在术语向量中的信息(如果在索引时启用)获取术语列表我们可以用来查询,而不必重新分析文本。我不知道弹性搜索是否尝试使用后一种方法,如果术语向量可用。

In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.

更像是这个查询,可以让你提供一个文本,无论你在哪里。该文本将用于查询您选择的字段并获取类似的文档。该文本将不会被完全使用,但不会被重新分析,并且只能保留最少$ max_query_terms (默认值25) c $ c> min_term_freq (最小期限频率,默认值为2),文档频率在 min_doc_freq max_doc_freq 。还有更多的参数也可以影响生成的查询。

The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms (default 25) will be kept, out of the terms that have at least the provided min_term_freq (minimum term frequency, default 2) and document frequency between min_doc_freq and max_doc_freq. There are more parameters too that can influence the generated query.

更像这个api 进一步,允许提供文档的id,再次提供一个字段列表。这些字段的内容将从该特定文档中提取出来,并用于在同一字段上更像此查询。这意味着生成的更像这样的查询将具有包含之前提取的文本的属性文本,并且将在相同的字段上执行。正如你可以看到更像这样的api在引擎盖下执行一个更像这样的查询。

The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.

让我们说更多的是这个查询给你更多的灵活性,因为你可以结合它与其他查询,你可以从任何你喜欢的源获得文本。
另一方面,更像这样的api暴露了通用功能,为您做了一些更多的工作,但有一些限制。

Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.

在你的情况下,我将结合一个几个不同的更多的这样的查询在一起,以便您可以利用强大的弹性搜索查询DSL,提升查询不同等等。缺点是您必须自己提供文本,因为您不能提供文档的ID以从中提取文本。

In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.

有不同的方法来实现什么你要。我将使用 bool查询来组合两个像这样在一个should子句中查询并赋予它们不同的权重。我还会使用更像这样的字段查询而不是一次查询一个字段。

There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.

{
    "bool" : {
        "must" : {
          {"match_all" : { }}
        },
        "should" : [
            {
              "more_like_this_field" : {
                "tags" : {
                  "like_text" : "here go the tags extracted from the current document!",
                  "boost" : 2.0
                }
              }
            },
            {
              "more_like_this_field" : {
                "content" : {
                  "like_text" : "here goes the content extracted from the current document!"
                }
              }
            }
        ],
        "minimum_number_should_match" : 1
    }
}

这样,至少有一个应用程序的子句必须匹配,标签匹配比内容匹配更重要。

This way at least one of the should clauses must match, and a match on tags is more important than a match on content.

这篇关于弹性体“更像这样” API与more_like_this查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆