在所有文档中获取前100个最常用的三个词短语 [英] Get top 100 most used three word phrases in all documents

查看：130 发布时间：2017/8/7 1:58:08 elasticsearch indexing lucene

本文介绍了在所有文档中获取前100个最常用的三个词短语的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大约15,000个刮取的网站，其身体文字存储在弹性搜索索引中。我需要在所有这些文本中使用前100个最常用的三个词短语：

这样的一个例子：

$ b $你好，先生：203
大坏小马：92
先到先得：56
[...]

我是新来的我研究了一些术语向量，但它们似乎适用于单个文档。所以我觉得这将是术语向量和聚合与n-gram分析的组合。但是我不知道如何去实现这个。

我目前的映射和设置：

  {
mappings：{
items：{
properties：{
body：{
类型：string，
term_vector：with_positions_offsets_payloads，
store：true，
analyzer：fulltext_analyzer
} 
} 
} 
}，
设置：{
index：{
number_of_shards：1，
number_of_replicas：0 
 $ b分析：{
analyzer：{
fulltext_analyzer：{
type：custom，
tokenizer ：空白，
过滤器：[
小写，
type_as_payload
] 
} 
} 
} 
} 
}

解决方案

寻找被称为带状疱疹。带状疱疹就像单词n-gram：字符串中多个术语的串行组合。（例如我们都活着，都住在，住在一个，黄色的，黄色的潜艇）

这里： https://www.elastic.co/blog/searching-with-shingles

基本上，您需要一个带有瓦片分析器的场，只能生成3个阶段的瓦片：

弹性博客文章配置，但是：

 filter_shingle：{
type ：shingle，
max_shingle_size：3，
min_shingle_size：3，
output_unigrams：false
} 
 重新建立数据之后，c>

< >，您应该可以发出一个查询，返回一个简单的术语聚合，在您的 b上ody 字段来查看顶部的一百个三个单词的短语。

  {
 size：0，
query：{
match_all：{} 
}，
aggs：{
three-word-phrase ：{
terms：{
field：body，
size：100 
} 
} 
} 
}

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:



Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.

My current mapping and settings:
{
  "mappings": {
    "items": {
      "properties": {
        "body": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

 解决方案 
What you're looking for are called Shingles.  Shingles are like "word n-grams":  serial combinations of more than one term in a string.  (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")

Take a look here:  https://www.elastic.co/blog/searching-with-shingles

Basically, you need a field with a shingle analyzer producing solely 3-term shingles:

Elastic blog-post configuration but with:
"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":3,
   "min_shingle_size":3,
   "output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "three-word-phrases" : {
      "terms" : {
        "field" : "body",
        "size"  : 100  
      }
    }
  }
}


                        
这篇关于在所有文档中获取前100个最常用的三个词短语的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在所有文档中获取前100个最常用的三个词短语 [英] Get top 100 most used three word phrases in all documents

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

在所有文档中获取前100个最常用的三个词短语 [英] Get top 100 most used three word phrases in all documents

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭