在所有文档中获取前100个最常用的三个词短语 [英] Get top 100 most used three word phrases in all documents

查看:130
本文介绍了在所有文档中获取前100个最常用的三个词短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约15,000个刮取的网站,其身体文字存储在弹性搜索索引中。我需要在所有这些文本中使用前100个最常用的三个词短语:



这样的一个例子:


$ b $你好,先生:203
大坏小马:92
先到先得:56
[...]

我是新来的我研究了一些术语向量,但它们似乎适用于单个文档。所以我觉得这将是术语向量和聚合与n-gram分析的组合。但是我不知道如何去实现这个。



我目前的映射和设置:

  {
mappings:{
items:{
properties:{
body:{
类型:string,
term_vector:with_positions_offsets_payloads,
store:true,
analyzer:fulltext_analyzer
}
}
}
},
设置:{
index:{
number_of_shards:1,
number_of_replicas:0
$ b分析:{
analyzer:{
fulltext_analyzer:{
type:custom,
tokenizer :空白,
过滤器:[
小写,
type_as_payload
]
}
}
}
}
}


解决方案

寻找被称为带状疱疹。带状疱疹就像单词n-gram:字符串中多个术语的串行组合。 (例如我们都活着,都住在,住在一个,黄色的,黄色的潜艇)



这里: https://www.elastic.co/blog/searching-with-shingles



基本上,您需要一个带有瓦片分析器的场,只能生成3个阶段的瓦片:



弹性博客文章配置,但是:

 filter_shingle:{
type :shingle,
max_shingle_size:3,
min_shingle_size:3,
output_unigrams:false
}
重新建立数据之后,c>



< >,您应该可以发出一个查询,返回一个简单的术语聚合,在您的 b上ody 字段来查看顶部的一百个三个单词的短语。

  {
size:0,
query:{
match_all:{}
},
aggs:{
three-word-phrase :{
terms:{
field:body,
size:100
}
}
}
}


I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:

Something like this:

Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]

I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.

My current mapping and settings:

{
  "mappings": {
    "items": {
      "properties": {
        "body": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

解决方案

What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")

Take a look here: https://www.elastic.co/blog/searching-with-shingles

Basically, you need a field with a shingle analyzer producing solely 3-term shingles:

Elastic blog-post configuration but with:

"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":3,
   "min_shingle_size":3,
   "output_unigrams":"false"
}

The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "three-word-phrases" : {
      "terms" : {
        "field" : "body",
        "size"  : 100  
      }
    }
  }
}

这篇关于在所有文档中获取前100个最常用的三个词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆