扩展Endeca的变音符号折叠贴图 [英] Extend Endeca's diacritic folding mapping

查看:92
本文介绍了扩展Endeca的变音符号折叠贴图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于ATG-Endeca应用程序,我们有一个包含希腊语和英语混合数据的索引.已索引的希腊语数据带有带有重音的单词.如果搜索词没有重音,则它们与任何数据都不匹配(或者由于没有字符重音的字符发生自动变位而与具有重音的字符匹配,因此这是不希望的功能). Dgidx标志-变音符号折叠配置不包括希腊字符的映射(

We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html).

是否可以通过Endeca端或核仁或代码中的属性文件扩展此oob功能?

Is it possible to extend this oob functionality thought a properties file in Endeca side or nucleous or code?

推荐答案

在文档中提供的状态如下:

In the documentation you provide it states:

Dgidx支持在索引过程中将Latin1,Latin Extended-A和Windows CP1252国际字符映射到它们的简单ASCII等效项.

Dgidx supports mapping Latin1, Latin extended-A, and Windows CP1252 international characters to their simple ASCII equivalents during indexing.

这表明不支持希腊语,因为它不属于任何这些字符集(我认为希腊语是Latin-7).也就是说,假设每种语言都有自己的记录,您可以尝试在记录级别设置语言标志(因为您指示数据包括英语和希腊语),或者尝试使用dgidx参数,但这会影响诸如以非全局语言表示的记录或属性的词干提取等问题.

This suggests that Greek is not supported since it doesn't fall into any of these character sets (I believe Greek is Latin-7). That said, you could try setting a language flag at a record level (since you indicate that your data includes both English and Greek) assuming that each language has its own record or try to implement a global language using the dgidx and dgraph parameters but this will affect things like stemming for records or properties not in the global language.

dgidx --lang el
dgraph --lang el

尽管我不确定它是否可以基于原始语句运行.

Though I'm not sure it will work based on the original statement.

或者,您可以使用自定义Accessor实施变音符号删除过程,该自定义Accessor扩展了atg.repository.search.indexing.PropertyAccessorImpl类(由于您引用了Nucleus,因此是一个选项,因此我假设您正在使用ATG/Oracle Commerce).使用此方法,您可以在索引中指定一个规范化的可搜索字段,该字段可复制当前索引中的可搜索字段,但是现在删除了所有变音符号.然后,您需要将在Accessor中应用的相同逻辑用作搜索项的预处理器,以便对输入进行归一化以匹配索引值.最后,使索引中的原始字段(带有重音符号)仅显示,而规范化的字段可搜索(但不显示).

Alternatively, you can implement a process of diacritic removal using a custom Accessor, which extends the atg.repository.search.indexing.PropertyAccessorImpl class (an option since you refer to Nucleus, so I assume you are using ATG/Oracle Commerce). Using this you specify a normalised searchable field in your index that duplicates the searchable fields in your current index but now with all diacritics removed. The same logic you apply in the Accessor then needs to be applied as a preprocessor on your search terms so that you normalise the input to match the indexed values. Lastly make your original fields in the index (with the accentuated characters) display-only and the normalised fields searchable (but don't display them).

结果将与您的规范化文本匹配,但缺点是您有重复的数据,因此索引将更大.小数据集不是一个大问题. OOTB功能(如词干)对规范化数据集的行为也可能会产生影响.您必须使用希腊语和英语对各种场景进行一些测试,以查看准确性和召回率是否受到不利影响.

The result will be matching your normalised text but the downside is you have duplicated data so your index will be bigger. Not a big issue with small data sets. There may also be an impact on how the OOTB functionality, like stemming, behaves with the normalised data set. You'll have to do some testing with various scenarios in Greek and English to see if the precision and recall is adversely affected.

这篇关于扩展Endeca的变音符号折叠贴图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆