一个字段的Solr多语言索引 [英] Solr Multilingual Indexing with one field

查看：96 发布时间：2020/5/13 18:45:59 indexing solr multilingual

本文介绍了一个字段的Solr多语言索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们当前的生产索引大小为1.5 TB，带3个分片.当前，我们具有以下字段类型:

Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

上述字段类型对于美国和英语语言客户来说效果很好.现在，我们有了一些新的中文和日语客户，因此在谷歌搜索后-

And the above field type is working well for the US and English language clients.Now we have some new Chinese and Japanese client ,so after googling--

http://www.basistech .com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing +和+搜索

对于多语言索引的最佳方法，似乎每种方法都有其优缺点.然后，我尝试使用单字段方法进行RnD，这是我的新字段类型:

for best approach for multilingual index,there seems to be pros/cons associated with every approach.Then i tried RnD with a single field approach and here's my new field type:

<fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

我保留了相同的标记生成器，只更改了过滤器.它与所有现有的英文文档搜索/用例以及中日文文档的新用例都可以很好地工作.

I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.

现在我对Solr专家/专家有以下问题:

Now i have the following questions to the Solr experts/gurus:

这是正确的方法吗?还是我想念什么?
您能举个例子吗，这个地方会出现问题以上新字段类型?一个带有示例的用例/场景将非常有帮助.
未来会有不同的客户出现问题上吗?

Is this a correct approach to do it? Or i'm missing something?
Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.
Also is there any problem in future with different clients coming up?

请提供一些指导或最佳策略.

Please provide some guidance or best strategy.

推荐答案

我的字段类型如下

<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

我没有发现任何语言的问题.我已经用法语，德语，中文，日语，阿拉伯语，波兰语，芬兰语等进行过验证.

I did not find any issue with it for any language. I have verified it with french, German, chinese, Japanese, Arabic, polish, finnish etc..

我发现您当前使用的语言应该没有任何语言的问题(我没有在solr分析工具中分析过您的fieldType).

I find the one you are using currently should not have any issue with any language(i didn't analysed your fieldType in the solr analysis tool).

如果您发现当前名为"text_ngram" 的fieldType有任何问题，请分享，这将有助于我进行更多分析.

If you have found any issue with your current fieldType named "text_ngram" please share then it would help me in to put more analysis.

否则，建议您使用当前版本.

Otherwise I suggest you to go with the current one.

还有一件事，如果更改字段类型，则必须考虑现有索引的重新索引，因为架构会发生变化.

One more thing, if you change the field type you have to consider the re-index of existing index as there is change in the schema.

这篇关于一个字段的Solr多语言索引的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一个字段的Solr多语言索引 [英] Solr Multilingual Indexing with one field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

一个字段的Solr多语言索引 [英] Solr Multilingual Indexing with one field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭