在lucene中索引多语言单词 [英] Indexing multilingual words in lucene

查看:97
本文介绍了在lucene中索引多语言单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Lucene中为可能具有不同语言的RDF文字的字段建立索引. 到目前为止,我所看到的大多数方法是:

I am trying to index in Lucene a field that could have RDF literal in different languages. Most of the approaches I have seen so far are:

  • 使用一个索引,每个文档使用的每种语言都有一个字段,或者

  • Use a single index, where each document has a field per each language it uses, or

使用M个索引,M是语料库中的语言数量.

Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+具有一项称为有效负载"的功能,该功能允许将属性附加到条件.是否有人使用这种机制来存储语言(或其他属性,例如数据类型)信息?与其他两种方法相比,性能如何?源代码上任何指示其完成方式的指针都将有所帮助.谢谢.

Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.

推荐答案

这要视情况而定.

  1. 是否要允许以下内容:在所有英文文本中搜索'foo'"?如果是这样,那么每种语言将需要一个字段.
  2. 还是您想要在所有文本中搜索'foo'并向用户显示匹配项所使用的语言?"如果这是您想要的,那么有效载荷或单独的字段都将起作用.
  3. 另一种方法是在一个字段中索引所有文本,然后在另一个字段中说文档的语言. (假设每个文档使用一种语言.)那么您的搜索将类似于+text:foo +language:english.
  1. Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
  2. Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
  3. An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.

在效率方面:您可能要避免使用有效载荷,因为您必须为每个术语重复使用语言的名称,并且您无法基于有效载荷进行搜索(至少不容易).

In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).

这篇关于在lucene中索引多语言单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆