一个字段的Solr多语言索引 [英] Solr Multilingual Indexing with one field
问题描述
我们当前的生产索引大小为1.5 TB,带3个分片.当前,我们具有以下字段类型:
Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>
上述字段类型对于美国和英语语言客户来说效果很好.现在,我们有了一些新的中文和日语客户,因此在谷歌搜索后-
And the above field type is working well for the US and English language clients.Now we have some new Chinese and Japanese client ,so after googling--
http://www.basistech .com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing +和+搜索
对于多语言索引的最佳方法,似乎每种方法都有其优缺点.然后,我尝试使用单字段方法进行RnD,这是我的新字段类型:
for best approach for multilingual index,there seems to be pros/cons associated with every approach.Then i tried RnD with a single field approach and here's my new field type:
<fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>
我保留了相同的标记生成器,只更改了过滤器.它与所有现有的英文文档搜索/用例以及中日文文档的新用例都可以很好地工作.
I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.
-
现在我对Solr专家/专家有以下问题:
Now i have the following questions to the Solr experts/gurus:
- 这是正确的方法吗?还是我想念什么?
- 您能举个例子吗,这个地方会出现问题 以上新字段类型?一个带有示例的用例/场景将非常 有帮助.
- 未来会有不同的客户出现问题 上吗?
- Is this a correct approach to do it? Or i'm missing something?
- Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.
- Also is there any problem in future with different clients coming up?
请提供一些指导或最佳策略.
Please provide some guidance or best strategy.
推荐答案
我的字段类型如下
<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
我没有发现任何语言的问题.我已经用法语,德语,中文,日语,阿拉伯语,波兰语,芬兰语等进行过验证.
I did not find any issue with it for any language. I have verified it with french, German, chinese, Japanese, Arabic, polish, finnish etc..
我发现您当前使用的语言应该没有任何语言的问题(我没有在solr分析工具中分析过您的fieldType).
I find the one you are using currently should not have any issue with any language(i didn't analysed your fieldType in the solr analysis tool).
如果您发现当前名为"text_ngram" 的fieldType有任何问题,请分享,这将有助于我进行更多分析.
If you have found any issue with your current fieldType named "text_ngram" please share then it would help me in to put more analysis.
否则,建议您使用当前版本.
Otherwise I suggest you to go with the current one.
还有一件事,如果更改字段类型,则必须考虑现有索引的重新索引,因为架构会发生变化.
One more thing, if you change the field type you have to consider the re-index of existing index as there is change in the schema.
这篇关于一个字段的Solr多语言索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!