使用 lucene 进行多语言搜索 [英] Multilingual Search using lucene

查看:21
本文介绍了使用 lucene 进行多语言搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行多语言搜索.并且我会使用 lucene 作为工具来做这件事.

I am doing a multilingual search. And I will use lucene as the tool to do it.

我已经有翻译的内容了,每个文档会有 3 或 4 种语言.

I have the translated contents already, there will be 3 or 4 languages of each document.

对于索引和搜索,可能有 4 种策略,对于每个文档/内容:

For indexing and search, there could be the 4 strategies, For each document/contents:

  1. 每种语言都在不同的索引/目录中编入索引.
  2. 每种语言都在不同的文档中编入索引,但在同一个索引中.
  3. 每种语言都被索引在不同的字段中,但在同一个文档中.
  4. 所有语言都被索引在文档的同一个字段中

但是我还没有测试每一种方式,有经验的人可以告诉我哪一种方式是进行多语言搜索的更好方法吗?

But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?

谢谢!

推荐答案

虽然这个问题在几年前就被问过了,但它仍然是一个很好的问题.

Although the question has been asked a couple of years ago, it's still a great question.

有几个方面需要考虑评估不同的解决方案:

There are a couple of aspects to consider evaluating the different solution approaches:

  1. 在编制索引时是否使用特定于语言的分析器?
  2. 查询语言是否总是已知的(例如用户可选择)?
  3. 查询语言是否总是匹配内容"语言之一?
  4. 是否只应重新调整与查询语言匹配的内容?
  5. 相关性重要吗?

如果 (1.) &(5.) 在您的项目中是有效的,您不应该考虑在同一个倒排索引中(重新)为多种语言使用相同字段的任何策略,因为各种语言的术语频率都是混合的(与您是否索引无关)您的多语言内容作为一个文档或多个文档).有趣的是,添加n"个语言特定字段不会导致n"倍大的索引,但由于显而易见的原因,它会带来一些开销.

If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.


单场(策略 2 和 4)

+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

多领域(策略 3)

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

多重指数(策略一)

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

独立于单个或多个字段方法,如果您将内容索引为多个文档,您的解决方案可能需要处理错误"语言匹配的结果折叠.一种方法可能是添加一个语言字段并为此过滤.

Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

建议:您选择的方法/策略取决于项目要求.只要有可能,我会选择多字段或多索引方法.

Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.

这篇关于使用 lucene 进行多语言搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆