使用lucene进行多语言搜索 [英] Multilingual Search using lucene

查看:114
本文介绍了使用lucene进行多语言搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行多语言搜索。我将使用lucene作为工具。



我已经翻译了内容,每个文档都会有3或4种语言。



对于索引和搜索,可能有4种策略:对于每个文档/内容:


  1. 每种语言都在不同的索引/目录中编入索引。

  2. 每种语言都在不同的文档中编入索引,但在同一索引中编制索引。不同的字段,但在同一个文档中。

  3. 所有语言都在文档的同一字段中编入索引

但是我还没有测试每种方法,有谁能够告诉我哪一种方法可以更好地进行多语言搜索吗?



谢谢! / p>

解决方案

尽管问题在几年前已经提出,但它仍然是一个很好的问题。



考虑评估差异有几个方面t解决方案方法:


  1. 是在索引时使用的特定于语言的分析器吗?

  2. 是查询语言总是知道的(例如用户可选择)?
  3. 查询语言是否总是与其中一种内容语言相匹配?
  4. 应该只重新匹配与查询语言匹配的内容吗?

  5. 相关性重要吗?



  6. 如果(1.)& (5.)在你的项目中是有效的,你不应该考虑任何策略(重新)在相同的倒排索引中重复使用多个语言的相同字段,因为各种语言的词频都混在一起(与你是否索引多语言内容作为一个文档或多个文档)。可能有趣的是,添加n个特定语言字段不会导致n倍更大的索引,但是由于显而易见的原因它带来了一些开销。





    单场(策略2& 4)





      +只有一个字段用于查询
    +可以很好地扩展其他语言
    +可以区分/过滤语言(如果有多个文档和额外的语言字段)
    - 无法区分/过滤语言(如果单个文档)
    - 不能显示查询语言(如果是单个文档)
    - 错误词语频率(所有语言混在一起)

    多个字段(策略3)

    < hr>

      +正确的词频
    +可以轻松地限制/过滤针对特定语言的查询
    +方便自动 - 完成&拼写检查/您的意思是
    - 更多字段来索引
    - 更多字段来查询


    $ b $ (策略1)




      +正确的词频
    +可以轻松地限制/过滤特定语言的查询
    +有助于自动完成& Spellcheck / Did-You-Mean
    - 其他语言需要自己的所有索引

    独立于单个或多个字段的方法,如果您将内容索引为多个文档,则您的解决方案可能需要处理错误语言匹配的结果折叠。一种方法可能是通过添加一个语言字段和过滤器。



    建议:您选择的方法/策略取决于项目要求。只要有可能,我会选择多个字段或多个索引方法。


    I am doing a multilingual search. And I will use lucene as the tool to do it.

    I have the translated contents already, there will be 3 or 4 languages of each document.

    For indexing and search, there could be the 4 strategies, For each document/contents:

    1. each language are indexed in different index/directory.
    2. each language are indexed in different document but in the same index.
    3. each language are indexed in different Field but in the same document.
    4. all the languages are indexed in the same Field in a document

    But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?

    Thanks!

    解决方案

    Although the question has been asked a couple of years ago, it's still a great question.

    There are a couple of aspects to consider evaluating the different solution approaches:

    1. are language specific analyzers used at indexing time?
    2. is the query language always known (e.g. user selectable)?
    3. does the query language always match one of the "content" languages?
    4. should only content matching the query language be retuned?
    5. is relevancy important?

    If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.


    Single Field (Strategies 2 & 4)


    + only one field to query
    + scales well for additional languages
    + can distinguish/filter languages (if multiple documents, and extra language field)
    - cannot distinguish/filter languages (if single document)
    - cannot just display the queried language (if single document)
    - "wrong" term frequencies (as all languages mixed up)
    

    Multiple Fields (Strategy 3)


    + correct term frequencies
    + can easily restrict/filter queries for particular language(s)
    + facilitates Auto-Complete & Spellcheck / Did-You-Mean
    - more fields to index
    - more fields to query
    

    Multiple Indices (Strategy 1)


    + correct term frequencies
    + can easily restrict/filter queries for particular language(s)
    + facilitates Auto-Complete & Spellcheck / Did-You-Mean
    - additional languages requires all their own index
    

    Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

    Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.

    这篇关于使用lucene进行多语言搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆