一个字段的Solr多语言索引 [英] Solr Multilingual Indexing with one field

查看:96
本文介绍了一个字段的Solr多语言索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们当前的生产索引大小为1.5 TB,带3个分片.当前,我们具有以下字段类型:

Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

上述字段类型对于美国和英语语言客户来说效果很好.现在,我们有了一些新的中文和日语客户,因此在谷歌搜索后-

And the above field type is working well for the US and English language clients.Now we have some new Chinese and Japanese client ,so after googling--

http://www.basistech .com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing +和+搜索

对于多语言索引的最佳方法,似乎每种方法都有其优缺点.然后,我尝试使用单字段方法进行RnD,这是我的新字段类型:

for best approach for multilingual index,there seems to be pros/cons associated with every approach.Then i tried RnD with a single field approach and here's my new field type:

<fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

我保留了相同的标记生成器,只更改了过滤器.它与所有现有的英文文档搜索/用例以及中日文文档的新用例都可以很好地工作.

I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.

  • 现在我对Solr专家/专家有以下问题:

  • Now i have the following questions to the Solr experts/gurus:

  1. 这是正确的方法吗?还是我想念什么?
  2. 您能举个例子吗,这个地​​方会出现问题 以上新字段类型?一个带有示例的用例/场景将非常 有帮助.
  3. 未来会有不同的客户出现问题 上吗?
  1. Is this a correct approach to do it? Or i'm missing something?
  2. Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.
  3. Also is there any problem in future with different clients coming up?

请提供一些指导或最佳策略.

Please provide some guidance or best strategy.

推荐答案

我的字段类型如下

<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

我没有发现任何语言的问题.我已经用法语,德语,中文,日语,阿拉伯语,波兰语,芬兰语等进行过验证.

I did not find any issue with it for any language. I have verified it with french, German, chinese, Japanese, Arabic, polish, finnish etc..

我发现您当前使用的语言应该没有任何语言的问题(我没有在solr分析工具中分析过您的fieldType).

I find the one you are using currently should not have any issue with any language(i didn't analysed your fieldType in the solr analysis tool).

如果您发现当前名为"text_ngram" 的fieldType有任何问题,请分享,这将有助于我进行更多分析.

If you have found any issue with your current fieldType named "text_ngram" please share then it would help me in to put more analysis.

否则,建议您使用当前版本.

Otherwise I suggest you to go with the current one.

还有一件事,如果更改字段类型,则必须考虑现有索引的重新索引,因为架构会发生变化.

One more thing, if you change the field type you have to consider the re-index of existing index as there is change in the schema.

这篇关于一个字段的Solr多语言索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆