多词词汇向量与词nGrams? [英] Multi-word Term Vectors with Word nGrams?

查看:165
本文介绍了多词词汇向量与词nGrams?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是建立一个索引,对于每个文档,将会用单词ngram(uni,bi和tri)进行分解,然后在所有这些单词中捕获词向量分析。是否可以使用Elasticsearch?



例如,对于包含红色车载驱动器的文档字段。我可以得到以下信息:

  red  -  1 instance 
car - 1 instance
drives - 1个实例
红车 - 1个实例
车载驱动器 - 1个实例
红色车载驱动器 - 1个实例

提前感谢

解决方案

假设您已经知道 Term Vectors api ,您可以应用木索令牌过滤器在索引时间将这些术语添加为独立于每个



min_shingle_size 设置为1(而不是默认值2), max_shingle_size 至少3(而不是默认值2)



根据您留下的离开了您应该使用停止词过滤器



分析仪设置将是这样的:

  {
settings:{
analysis:{
analyzer:{
evolutionAnalyzer:{
tokenizer标准,
过滤器:[
标准,
小写,
custom_stop,
custom_shingle
]

},
过滤器:{
custom_stop:{
type:stop,
stopwords:_english_ ,
enable_position_increments:false
},
custom_shingle:{
type:shingle,
min_shingle_size:1 ,
max_shingle_size:3
}
}
}
}
}

您可以使用 _analyze api终结点


I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch?

For instance, for a document field containing "The red car drives." I would be able to get the information:

red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance

Thanks in advance!

解决方案

Assuming you already know about the Term Vectors api you could apply the shingle token filter at index time to add those terms as independent to each other in the token stream.

Setting min_shingle_size to 1 (instead of the default of 2), and max_shingle_size to at least 3 (instead of the default of 2)

And based on the fact that you left "the" out of the possible terms you should use stop words filter before applying shingles filter.

The analyzer settings would be something like this:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "evolutionAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle"
          ]
        }
      },
      "filter": {
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_",
            "enable_position_increments":"false"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "1",
            "max_shingle_size": "3"
        }
      }
    }
  }
}

You can test the analyzer using the _analyze api endpoint.

这篇关于多词词汇向量与词nGrams?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆