使用弹性搜索从文本中提取关键字(多词) [英] Extract keywords (multi word) from text using elastic search

查看:31
本文介绍了使用弹性搜索从文本中提取关键字(多词)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个充满关键字的索引,我想根据这些关键字从输入文本中提取关键字.

I have an index full of keywords and based on those keywords I want to extract the keywords from the input text.

以下是示例关键字索引.请注意,关键字也可以是多个词,或者基本上是唯一的标签.

Following is the sample keyword index. Please note that the keywords can be of multiple words too, or basically they are tags which are unique.

{
  "hits": {
    "total": 2000,
    "hits": [
      {
        "id": 1,
        "keyword": "thousand eyes"
      },
      {
        "id": 2,
        "keyword": "facebook"
      },
      {
        "id": 3,
        "keyword": "superdoc"
      },
      {
        "id": 4,
        "keyword": "quora"
      },
      {
        "id": 5,
        "keyword": "your story"
      },
      {
        "id": 6,
        "keyword": "Surgery"
      },
      {
        "id": 7,
        "keyword": "lending club"
      },
      {
        "id": 8,
        "keyword": "ad roll"
      },
      {
        "id": 9,
        "keyword": "the honest company"
      },
      {
        "id": 10,
        "keyword": "Draft kings"
      }
    ]
  }
}

现在,如果我输入文本为我在 facebook 上看到了借贷俱乐部的消息,你的故事和 quora" 搜索的输出应该是 [借贷俱乐部",facebook"、你的故事"、quora"].此外,搜索应该不区分大小写

Now, if I input the text as "I saw the news of lending club on facebook, your story and quora" the output of the search should be ["lending club", "facebook", "your story", "quora"]. Also the search should be case insensetive

推荐答案

只有一种真正的方法可以做到这一点.您必须将您的数据作为关键字编制索引,并使用带状疱疹对其进行搜索:

There's just one real way to do this. You'll have to index your your data as keywords and search it analyzed with shingles:

查看此复制品:

首先,我们将创建两个自定义分析器:关键字和带状疱疹:

First, we'll create two custom analyzers: keyword and shingles:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "keyword": {
          "type": "string",
          "index_analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

现在让我们使用您提供的数据创建一些示例数据:

Now let's create some sample data using what you gave us:

POST /test/your_type/1
{
  "id": 1,
  "keyword": "thousand eyes"
}
POST /test/your_type/2
{
  "id": 2,
  "keyword": "facebook"
}
POST /test/your_type/3
{
  "id": 3,
  "keyword": "superdoc"
}
POST /test/your_type/4
{
  "id": 4,
  "keyword": "quora"
}
POST /test/your_type/5
{
  "id": 5,
  "keyword": "your story"
}
POST /test/your_type/6
{
  "id": 6,
  "keyword": "Surgery"
}
POST /test/your_type/7
{
  "id": 7,
  "keyword": "lending club"
}
POST /test/your_type/8
{
  "id": 8,
  "keyword": "ad roll"
}
POST /test/your_type/9
{
  "id": 9,
  "keyword": "the honest company"
}
POST /test/your_type/10
{
  "id": 10,
  "keyword": "Draft kings"
}

最后查询运行搜索:

POST /test/your_type/_search
{
  "query": {
    "match": {
      "keyword": "I saw the news of lending club on facebook, your story and quora"
    }
  }
}

这是结果:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.009332742,
    "hits": [
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "2",
        "_score": 0.009332742,
        "_source": {
          "id": 2,
          "keyword": "facebook"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "7",
        "_score": 0.009332742,
        "_source": {
          "id": 7,
          "keyword": "lending club"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "4",
        "_score": 0.009207102,
        "_source": {
          "id": 4,
          "keyword": "quora"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "5",
        "_score": 0.0014755741,
        "_source": {
          "id": 5,
          "keyword": "your story"
        }
      }
    ]
  }
}

那么它在幕后做了什么?

So what it does behind the scenes?

  1. 它将您的文档作为整个关键字进行索引(它将整个字符串作为单个标记发出).我还添加了 asciifolding 过滤器,因此它标准化了字母,即 é 变为 e) 和小写过滤器(不区分大小写的搜索).例如,draft kings 被索引为 draft kings
  2. 现在搜索分析器使用相同的逻辑,除了它的分词器发出单词标记,并在此之上创建带状疱疹(标记的组合),它将匹配您在第一步中索引的关键字.
  1. It indexes your documents as whole keywords (It emits whole string as a single token). I've also added asciifolding filter, so it normalizes letters, i.e. é becomes e) and lowercase filter (case insensitive search). So for instance Draft kings is indexed as draft kings
  2. Now search analyzer is using same logic, except that its' tokenizer is emitting word tokens and on top of that creates shingles(combination of tokens), which will match your keywords indexed as in first step.

这篇关于使用弹性搜索从文本中提取关键字(多词)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆