在所有文档中获取前100个最常用的三个词短语 [英] Get top 100 most used three word phrases in all documents
问题描述
我有大约15,000个刮取的网站,其身体文字存储在弹性搜索索引中。我需要在所有这些文本中使用前100个最常用的三个词短语:
这样的一个例子:
$ b $你好,先生:203
大坏小马:92
先到先得:56
[...]
我是新来的我研究了一些术语向量,但它们似乎适用于单个文档。所以我觉得这将是术语向量和聚合与n-gram分析的组合。但是我不知道如何去实现这个。
我目前的映射和设置:
{
mappings:{
items:{
properties:{
body:{
类型:string,
term_vector:with_positions_offsets_payloads,
store:true,
analyzer:fulltext_analyzer
}
}
}
},
设置:{
index:{
number_of_shards:1,
number_of_replicas:0
$ b分析:{
analyzer:{
fulltext_analyzer:{
type:custom,
tokenizer :空白,
过滤器:[
小写,
type_as_payload
]
}
}
}
}
}
寻找被称为带状疱疹。带状疱疹就像单词n-gram:字符串中多个术语的串行组合。 (例如我们都活着,都住在,住在一个,黄色的,黄色的潜艇)
这里: https://www.elastic.co/blog/searching-with-shingles
基本上,您需要一个带有瓦片分析器的场,只能生成3个阶段的瓦片:
弹性博客文章配置,但是:
filter_shingle:{
type :shingle,
max_shingle_size:3,
min_shingle_size:3,
output_unigrams:false
}
$ c $在将木瓦分析仪应用于有问题的领域(如博客文章)中,并且重新建立数据之后,c>
< >,您应该可以发出一个查询,返回一个简单的术语聚合,在您的 b上ody
字段来查看顶部的一百个三个单词的短语。
{
size:0,
query:{
match_all:{}
},
aggs:{
three-word-phrase :{
terms:{
field:body,
size:100
}
}
}
}
I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body
field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}
这篇关于在所有文档中获取前100个最常用的三个词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!