多词词汇向量与词nGrams? [英] Multi-word Term Vectors with Word nGrams?
问题描述
例如,对于包含红色车载驱动器的文档字段。我可以得到以下信息:
red - 1 instance
car - 1 instance
drives - 1个实例
红车 - 1个实例
车载驱动器 - 1个实例
红色车载驱动器 - 1个实例
提前感谢
假设您已经知道 Term Vectors api ,您可以应用木索令牌过滤器在索引时间将这些术语添加为独立于每个
将 min_shingle_size
设置为1(而不是默认值2), max_shingle_size
至少3(而不是默认值2)
根据您留下的离开了您应该使用停止词过滤器
分析仪设置将是这样的:
{
settings:{
analysis:{
analyzer:{
evolutionAnalyzer:{
tokenizer标准,
过滤器:[
标准,
小写,
custom_stop,
custom_shingle
]
},
过滤器:{
custom_stop:{
type:stop,
stopwords:_english_ ,
enable_position_increments:false
},
custom_shingle:{
type:shingle,
min_shingle_size:1 ,
max_shingle_size:3
}
}
}
}
}
您可以使用 _analyze
api终结点。
I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch?
For instance, for a document field containing "The red car drives." I would be able to get the information:
red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance
Thanks in advance!
Assuming you already know about the Term Vectors api you could apply the shingle token filter at index time to add those terms as independent to each other in the token stream.
Setting min_shingle_size
to 1 (instead of the default of 2), and max_shingle_size
to at least 3 (instead of the default of 2)
And based on the fact that you left "the" out of the possible terms you should use stop words filter before applying shingles filter.
The analyzer settings would be something like this:
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_",
"enable_position_increments":"false"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "1",
"max_shingle_size": "3"
}
}
}
}
}
You can test the analyzer using the _analyze
api endpoint.
这篇关于多词词汇向量与词nGrams?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!