自定义“选项卡”令牌在ElasticSearch NEST 2.4 [英] Custom "tab" Tokenizer in ElasticSearch NEST 2.4

查看:165
本文介绍了自定义“选项卡”令牌在ElasticSearch NEST 2.4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多字段的索引,一个字段ServiceCategories具有类似于此的数据:


|案例管理|开发残疾


我需要分隔数据|我试图这样做:

  var descriptor = new CreateIndexDescriptor(_DataSource.ToLower())
.Mappings(ms => ms
.Map< ProviderContent>(m => m
.AutoMap()
.Properties(p => p
.String s => s
.Name(n => n.OrganizationName)
.Fields(f => f
.String(ss => ss.Name(raw ).NotAnalyzed())))
.String(s => s
.Name(n => n.ServiceCategories)
.Analyzer(tab_delim_analyzer))
.GeoPoint(g => g.Name(n => n.Location).LatLon(true))))
.Settings(st => st
.Analysis(an = &$ a
.Analyzers(anz => anz
.Custom(tab_delim_analyzer,td => td
.Filters(lowercase)
.Tokenizer(tab_delim_tokenizer)))
.Tokenizers(t => t
.Pattern(tab_delim_tokenizer,tdt => tdt
.Pattern(|)))));
_elasticClientWrapper.CreateIndex(descriptor);

我的ServiceCategories(serviceCategories到ES)的搜索代码使用一个简单的TermQuery,其值设置为小写



没有使用此搜索参数获得结果(其他工作正常)。预期的结果是从上面的至少一个术语中获得完全匹配。



我尝试通过使用经典的标记器来使其工作:

  var descriptor = new CreateIndexDescriptor(_DataSource.ToLower())
.Mappings(ms => ms
.Map< ; ProviderContent>(m => m
.AutoMap()
.Properties(p => p
.String(s => s
.Name > n.OrganizationName)
.Fields(f => f
.String(ss => ss.Name(raw)。NotAnalyzed())))
.String (s => s
.Name(n => n.ServiceCategories)
.Analyzer(classic_tokenizer)
.SearchAnalyzer(standard))
.GeoPoint (g => g.Name(n => n.Location).LatLon(true)))))
。设置s(s => s
.Analysis(an => a
.Analyzers(a => a.Custom(classic_tokenizer,ca => ca
.Tokenizer(classic))) ));

这也不工作。任何人都可以帮助我确定我在哪里出错?



这是搜索请求:

 code> ### ES REQEUST ### 
{
from:0,
size:10,
sort:[
{
organizationName:{
order:asc
}
}
],
查询:{
bool:{
must:[
{
match_all:{}
},
{
term
serviceCategories:{
value:发展障碍
}
}
}
]
}
}
}


解决方案

c $ c> tab_delim_tokenizer 是接近但不太正确的:)最简单的方法是使用Analyze API来了解分析器如何标记一段文本。通过您的第一个映射,我们可以检查自定义分析器的功能。

  client.Analyze(a => a 
.Index(_DataSource.ToLower())
.Analyzer(tab_delim_analyzer)
.Text(| Case Management | Developmental Disabilities)
);

其返回(为了简洁而剪切)

  {
tokens:[{
token:|,
start_offset:0,
end_offset :1,
type:word,
position:0
},{
token:c,
start_offset :1,
end_offset:2,
type:word,
position:1
},{
token一个,
start_offset:2,
end_offset:3,
type:word,
position:2
} {
token:s,
start_offset:3,
end_offset:4,
type:word,
位置:3
},...]
}

tab_delim_tokenizer 不会令我们期待如何。一个小的修改通过使用 \ 转义模式中的 | 来解决这个问题,并使模式成为一个逐字字符串文字,前缀为 @



这是一个完整的例子

  void Main()
{
var pool = new SingleNodeConnectionPool(new Uri(http:// localhost:9200));
var defaultIndex =default-index;
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);

var client = new ElasticClient(connectionSettings);

if(client.IndexExists(defaultIndex).Exists)
client.DeleteIndex(defaultIndex);

var descriptor = new CreateIndexDescriptor(defaultIndex)
.Mappings(ms => ms
.Map< ProviderContent>(m => m
.AutoMap )
.Properties(p => p
.String(s => s
.Name(n => n.OrganizationName)
.Fields(f => ; f
.String(ss => ss.Name(raw)。NotAnalyzed())))
.String(s => s
.Name(n => ; n.ServiceCategories)
.Analyzer(tab_delim_analyzer)

.GeoPoint(g => g
.Name(n => n.Location)
.LatLon(true)




.Settings(st => st
.Analysis(an => a
。分析师( anz =>
.Custom(tab_delim_analyzer,td => td
。 b $ b .Tokenizers(t => t
.Pattern(tab_delim_tokenizer,tdt => tdt
.Pattern(@\ |)



);

client.CreateIndex(descriptor);

//检查我们的自定义分析器,我们认为它应该是
client.Analyze(a => a
.Index(defaultIndex)
.Analyzer( tab_delim_analyzer)
.Text(| Case Management | Developmental Disabilities)
);

//索引一个文件并立即可以搜索
client.Index(new ProviderContent
{
OrganizationName =Elastic,
ServiceCategories =| Case Management | Developmental Disabilities
},i => i.Refresh());


//搜索我们的文档。在bool过滤器子句
//中使用术语查询,因为我们不需要评分(可能)
client.Search< ProviderContent>(s => s
.From(0)
.Size(10)
.Sort(so => so
.Ascending(f => f.OrganizationName)

.Query(q = > + q
.Term(f => f.ServiceCategories,developmental disability)

);

}

public class ProviderContent
{
public string OrganizationName {get;组; }

public string ServiceCategories {get;组; }

public GeoLocation Location {get;组; }
}

搜索结果返回

  {
take:2,
timed_out:false,
_shards:{
total :5,
success:5,
failed:0
},
hits:{
total:1,
max_score:null,
hits:[{
_index:default-index,
_type:providercontent,
_id :$
_source:$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $$$$$残疾
},
排序:[弹性]
}]
}
}
pre>

I have an index with many fields, and one field "ServiceCategories" has data similar to this:

|Case Management|Developmental Disabilities

I need to break up the data by the separator "|" and I have attempted to do so with this:

    var descriptor = new CreateIndexDescriptor(_DataSource.ToLower())
        .Mappings(ms => ms
            .Map<ProviderContent>(m => m
                .AutoMap()
                .Properties(p => p
                    .String(s => s
                        .Name(n => n.OrganizationName)
                        .Fields(f => f
                            .String(ss => ss.Name("raw").NotAnalyzed())))
                    .String(s => s
                        .Name(n => n.ServiceCategories)
                        .Analyzer("tab_delim_analyzer"))
                    .GeoPoint(g => g.Name(n => n.Location).LatLon(true)))))
        .Settings(st => st
            .Analysis(an => an
                .Analyzers(anz => anz
                    .Custom("tab_delim_analyzer", td => td
                        .Filters("lowercase")
                    .Tokenizer("tab_delim_tokenizer")))
                .Tokenizers(t => t
                    .Pattern("tab_delim_tokenizer", tdt => tdt
                        .Pattern("|")))));
    _elasticClientWrapper.CreateIndex(descriptor);

My search code for ServiceCategories (serviceCategories to ES) uses a simple TermQuery with the value set to lower case.

It's not getting results using this search parameter (the others work fine). Expected results are to get exact matches on at least one term from the above.

I have attempted to get it working by using a classic tokenizer as well:

    var descriptor = new CreateIndexDescriptor(_DataSource.ToLower())
        .Mappings(ms => ms
            .Map<ProviderContent>(m => m
                .AutoMap()
                .Properties(p => p
                    .String(s => s
                        .Name(n => n.OrganizationName)
                        .Fields(f => f
                            .String(ss => ss.Name("raw").NotAnalyzed())))
                    .String(s => s
                        .Name(n => n.ServiceCategories)
                        .Analyzer("classic_tokenizer")
                        .SearchAnalyzer("standard"))
                    .GeoPoint(g => g.Name(n => n.Location).LatLon(true)))))
        .Settings(s => s
            .Analysis(an => an
                .Analyzers(a => a.Custom("classic_tokenizer", ca => ca
                    .Tokenizer("classic")))));

This isn't working either. Can anyone help me identify where I am going wrong?

Here's the search request:

### ES REQEUST ###
{
  "from": 0,
  "size": 10,
  "sort": [
    {
      "organizationName": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        },
        {
          "term": {
            "serviceCategories": {
              "value": "developmental disabilities"
            }
          }
        }
      ]
    }
  }
}

解决方案

Your pattern for tab_delim_tokenizer is close, but not quite correct :) The easiest way to see this is to use the Analyze API to understand how an Analyzer will tokenize a piece of text. With your first mapping in place, we can check what the custom analyzer does

client.Analyze(a => a
    .Index(_DataSource.ToLower())
    .Analyzer("tab_delim_analyzer")
    .Text("|Case Management|Developmental Disabilities")
);

which returns (snipped for brevity)

{
  "tokens" : [ {
    "token" : "|",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "c",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "a",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "s",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, ... ]
}

demonstrating that the tab_delim_tokenizer is not tokenizing how we expect. A small change fixes this by escaping the | in the pattern with \ and making the pattern a verbatim string literal by prefixing with @.

Here's a complete example

void Main()
{
    var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
    var defaultIndex = "default-index";
    var connectionSettings = new ConnectionSettings(pool)
            .DefaultIndex(defaultIndex);

    var client = new ElasticClient(connectionSettings);

    if (client.IndexExists(defaultIndex).Exists)
        client.DeleteIndex(defaultIndex);

    var descriptor = new CreateIndexDescriptor(defaultIndex)
        .Mappings(ms => ms
            .Map<ProviderContent>(m => m
                .AutoMap()
                .Properties(p => p
                    .String(s => s
                        .Name(n => n.OrganizationName)
                        .Fields(f => f
                            .String(ss => ss.Name("raw").NotAnalyzed())))
                    .String(s => s
                        .Name(n => n.ServiceCategories)
                        .Analyzer("tab_delim_analyzer")
                    )
                    .GeoPoint(g => g
                        .Name(n => n.Location)
                        .LatLon(true)
                    )
                )
            )
        )
        .Settings(st => st
            .Analysis(an => an
                .Analyzers(anz => anz
                    .Custom("tab_delim_analyzer", td => td
                        .Filters("lowercase")
                        .Tokenizer("tab_delim_tokenizer")
                    )
                )
                .Tokenizers(t => t
                    .Pattern("tab_delim_tokenizer", tdt => tdt
                        .Pattern(@"\|")
                    )
                )
            )
        );

    client.CreateIndex(descriptor);

    // check our custom analyzer does what we think it should
    client.Analyze(a => a
        .Index(defaultIndex)
        .Analyzer("tab_delim_analyzer")
        .Text("|Case Management|Developmental Disabilities")
    );

    // index a document and make it immediately available for search
    client.Index(new ProviderContent
    {   
        OrganizationName = "Elastic",
        ServiceCategories = "|Case Management|Developmental Disabilities"
    }, i => i.Refresh());


    // search for our document. Use a term query in a bool filter clause
    // as we don't need scoring (probably)
    client.Search<ProviderContent>(s => s
        .From(0)
        .Size(10)
        .Sort(so => so
            .Ascending(f => f.OrganizationName)
        )
        .Query(q => +q
            .Term(f => f.ServiceCategories, "developmental disabilities")          
        )
    );

}

public class ProviderContent
{
    public string OrganizationName { get; set; }

    public string ServiceCategories { get; set; }

    public GeoLocation Location { get; set; }
}

the search results return

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : null,
    "hits" : [ {
      "_index" : "default-index",
      "_type" : "providercontent",
      "_id" : "AVqNNqlQpAW_5iHrnIDQ",
      "_score" : null,
      "_source" : {
        "organizationName" : "Elastic",
        "serviceCategories" : "|Case Management|Developmental Disabilities"
      },
      "sort" : [ "elastic" ]
    } ]
  }
}

这篇关于自定义“选项卡”令牌在ElasticSearch NEST 2.4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆