使用弹性搜索从文本中提取关键字(多个单词) [英] Extract keywords (multi word) from text using elastic search

查看:128
本文介绍了使用弹性搜索从文本中提取关键字(多个单词)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



以下是示例关键字索引。请注意,关键字也可以是多个单词,或者基本上它们是唯一的标签。

  {
hits:{
total:2000,
hits:[
{
id:1,
keyword:
},
{
id:2,
关键字:facebook
},
{
id :3,
关键字:superdoc
},
{
id:4,
关键字:quora
},
{
id:5,
关键字:你的故事
},
{
id:6 ,
关键字:手术
},
{
id:7,
关键字:借贷俱乐部
} ,
{
id:8,
关键字:广告卷
},
{
id:9,
关键字:诚实公司
},
{
id:10 ,
关键字:国王之王
}
]
}
}

现在,如果我输入的文字为我在Facebook上看到借阅俱乐部的消息,你的故事和quora搜索的输出应该是 [借贷俱乐部,facebook,你的故事,quora] 。此外,搜索应该是 case insensetive

解决方案

只有一个真正的方法来做到这一点。您必须将您的数据作为关键字进行索引,并使用带状疱疹进行分析搜索:



查看此复制品:



首先,我们将创建两个自定义分析器:关键字和带状键:

  PUT测试
{
设置:{
分析:{
analyzer:{
my_analyzer_keyword:{
type:custom,
bbbbb






$ b {
type:custom,
tokenizer:standard,
filter:[
asciifolding,
smallcase
shingle
]
}
}
}
},
mappings:{
your_type {
properties:{
keyword:{
type:string,
in dex_analyzer:my_analyzer_keyword,
search_analyzer:my_analyzer_shingle
}
}
}
}
}

现在我们使用你给我们创建一些样本数据:

  POST / test / your_type / 1 
{
id:1,
关键字:千眼
}
POST / test / your_type / 2
{
id:2,
keyword:facebook
}
POST / test / your_type / 3
{
id:3,
关键字:superdoc
}
POST / test / your_type / 4
{
id:4,
关键字:quora
}
POST / test / your_type / 5
{
id:5,
关键字:你的故事
}
POST / test / your_type / 6
{
id:6,
keyword 手术
}
POST / test / your_type / 7
{
id:7,
关键字:借贷俱乐部

POST / test / your_type / 8
{
id:8,
关键字:广告卷
}
POST / test / your_type / 9
{
id:9,
关键字:诚实公司
}
POST / test / your_type / 10
{
id:10,
关键字:国王之王
}

最后查询运行搜索:

  POST / test / your_type / _search 
{
查询:{
match:{
关键字:我看到Facebook上的贷款俱乐部的消息,你的故事和quora
}
}

这是结果:

  {
take:6,
timed_out:false,
_shards:{
total:5,
success:5,
failed:0
},
hits:{
total:4,
max_score :$ 93 $$$$$$$$$$$$$ 2,
_score:0.009332742,
_source:{
id:2,
关键字:facebook
}
},
{
_index:test,
_type:your_type,
_id:7,
_score:0.009332742,
_source:{
id:7,
关键字:借贷俱乐部
}
},
{
_index:test,
_type:your_type,
_id:4,
_score:0.009207102,
_source:{
id:4,
关键字:quora
}
},
{
_index:test,
_type:your_type,
_id:5,
_score:0.0014755741,
_source {
id:5,
关键字:你的故事
}
}
]
}
}

所以幕后做了什么?


  1. 它将您的文档作为整个关键字(将整个字符串作为单个令牌)进行索引。我还添加了asciifolding过滤器,因此它使字母正常化,即é变为 e )和小写过滤器(不区分大小写)搜索)。所以例如草案国王被索引为草案国王

  2. 现在搜索分析器正在使用相同的逻辑,除了它的标记器发出单词令牌,并且最重要的是创建带状键(组合的令牌),这将匹配您的关键字如第一步索引。


I have an index full of keywords and based on those keywords I want to extract the keywords from the input text.

Following is the sample keyword index. Please note that the keywords can be of multiple words too, or basically they are tags which are unique.

{
  "hits": {
    "total": 2000,
    "hits": [
      {
        "id": 1,
        "keyword": "thousand eyes"
      },
      {
        "id": 2,
        "keyword": "facebook"
      },
      {
        "id": 3,
        "keyword": "superdoc"
      },
      {
        "id": 4,
        "keyword": "quora"
      },
      {
        "id": 5,
        "keyword": "your story"
      },
      {
        "id": 6,
        "keyword": "Surgery"
      },
      {
        "id": 7,
        "keyword": "lending club"
      },
      {
        "id": 8,
        "keyword": "ad roll"
      },
      {
        "id": 9,
        "keyword": "the honest company"
      },
      {
        "id": 10,
        "keyword": "Draft kings"
      }
    ]
  }
}

Now, if I input the text as "I saw the news of lending club on facebook, your story and quora" the output of the search should be ["lending club", "facebook", "your story", "quora"]. Also the search should be case insensetive

解决方案

There's just one real way to do this. You'll have to index your your data as keywords and search it analyzed with shingles:

See this reproduction:

First, we'll create two custom analyzers: keyword and shingles:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "keyword": {
          "type": "string",
          "index_analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

Now let's create some sample data using what you gave us:

POST /test/your_type/1
{
  "id": 1,
  "keyword": "thousand eyes"
}
POST /test/your_type/2
{
  "id": 2,
  "keyword": "facebook"
}
POST /test/your_type/3
{
  "id": 3,
  "keyword": "superdoc"
}
POST /test/your_type/4
{
  "id": 4,
  "keyword": "quora"
}
POST /test/your_type/5
{
  "id": 5,
  "keyword": "your story"
}
POST /test/your_type/6
{
  "id": 6,
  "keyword": "Surgery"
}
POST /test/your_type/7
{
  "id": 7,
  "keyword": "lending club"
}
POST /test/your_type/8
{
  "id": 8,
  "keyword": "ad roll"
}
POST /test/your_type/9
{
  "id": 9,
  "keyword": "the honest company"
}
POST /test/your_type/10
{
  "id": 10,
  "keyword": "Draft kings"
}

And finally query to run search:

POST /test/your_type/_search
{
  "query": {
    "match": {
      "keyword": "I saw the news of lending club on facebook, your story and quora"
    }
  }
}

And this is result:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.009332742,
    "hits": [
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "2",
        "_score": 0.009332742,
        "_source": {
          "id": 2,
          "keyword": "facebook"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "7",
        "_score": 0.009332742,
        "_source": {
          "id": 7,
          "keyword": "lending club"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "4",
        "_score": 0.009207102,
        "_source": {
          "id": 4,
          "keyword": "quora"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "5",
        "_score": 0.0014755741,
        "_source": {
          "id": 5,
          "keyword": "your story"
        }
      }
    ]
  }
}

So what it does behind the scenes?

  1. It indexes your documents as whole keywords (It emits whole string as a single token). I've also added asciifolding filter, so it normalizes letters, i.e. é becomes e) and lowercase filter (case insensitive search). So for instance Draft kings is indexed as draft kings
  2. Now search analyzer is using same logic, except that its' tokenizer is emitting word tokens and on top of that creates shingles(combination of tokens), which will match your keywords indexed as in first step.

这篇关于使用弹性搜索从文本中提取关键字(多个单词)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆