如何在Lucene中存储多种不同类型的文档 [英] How to store multiple distinct types of documents in Lucene

查看:73
本文介绍了如何在Lucene中存储多种不同类型的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个现有的Lucene商店,其中存储着数百万个文档,每个文档代表一个实体的元数据.我有几个Id字段(Id1,Id2 .... Id5),每个文档对此字段可以有零个或多个值.一次只能由这些ID之一查询该索引.我已经独立索引了这些字段,并且一切都很好.我最初选择使用Lucene,因为它是迄今为止查询大量小文件的最快方法,我对自己的决定感到满意.

I have an existing Lucene store with many millions of documents, each one representing metadata for an entity. I have a few Id fields (Id1, Id2 .. Id5) and each document can have zero or many values for this field. The index is only ever queried by one of these Ids at a time. I've indexed these fields independently and it's is all working great. I initially chose to use Lucene as it was by far the fastest way to query such a vast number of small documents and I am happy with my decision.

但是,现在我必须存储另一种类型的文档,该文档还代表实体的另一种元数据,并具有(Id1,Id2 .. Id5)的值,并且这些ID也将单独查询它们.现有元数据和这组新数据将彼此独立存储和查询.

However now I must store another type of document which also represent a different kind of metadata for entities and have values for (Id1, Id2 .. Id5), and which also will be queried by one of those Ids separately. The existing metadata and this new set of data will be stored and queried for independently from each other.

如何通过ID来查询Lucene,但仅查询一种类型的文档.我可以考虑一些选择,但是我想从经验中知道那些建议,以便使Lucene易于管理和快速进行.

How do I query Lucene by an Id but for only one type of document. I can think of a few options, but I'd like to know what those in the know recommend from experience in order to keep Lucene manageable and fast.

  1. 使用单独的Lucene索引.由于文档类型是正交的,因此可以避免该问题.能够分别从索引进行读取和写入还有一个好处.
  2. 将新文档的Id1..Idn字段重命名为XId1 ... XIdn.这样,一种类型的文档将不会具有与另一种类型的文档相同的字段名称.似乎比实际的解决方案更像是一种避免该问题的解决方法.
  3. 添加一个数字字段类型",并将索引更改为(类型,Idx).这种方法似乎很浪费,因为每个索引还必须包含类型.

我可以打破与现有设置的向后兼容性.如果我要添加其他文档类型,则可以重用该解决方案将是很好的选择.

I am able to break backwards compatibility with my existing setup. It would be great if the solution can be reused if I come to add another document type.

推荐答案

我肯定会拒绝第三种选择,因为type索引的选择性低. type字段中只有2个不同的值,每个值包含数百万个文档. Lucene将需要将这个庞大的发布列表与idN索引中的简短发布列表合并,这仍然可以非常快,但是确实很浪费.

I would definitely reject third option because of low selectivity of type index. There will be only 2 distinct values in type field each one with millions of documents. Lucene will need to merge this huge posting list with short posting list from idN index, which still can be very fast, but indeed wasteful.

在查询阶段,前两种方法实际上是相同的,因为您对独立类型的文档具有不同的术语和过帐列表.区别在于索引阶段.管理几个独立的索引需要更多的协调,并使代码更加困难.但是,如果您计划在不同的上下文中使用索引,则可能是一个好主意.例如:

First two ways are effectively the same on query phase, because you have different terms and posting lists for independent type of documents. Difference will be on the indexing phase. Managing several independent indexes require a bit more coordination and makes code a little bit more difficult. Yet it may be a good idea if you have plans on using indexes in different contexts. For example:

  • 地理位置;
  • 备份策略;
  • 可用性要求;
  • 建立索引的时间要求(从文档更改到客户端在索引中可见的时间)

否则,我会选择第一个选项,因为它更简单,更易于管理.

Otherwise, I would go with a first option as more simple and manageable.

这篇关于如何在Lucene中存储多种不同类型的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆