多个索引或多个映射类型用于稀疏文档? [英] Multiple indexes or multiple mapping types for sparse documents?

查看:104
本文介绍了多个索引或多个映射类型用于稀疏文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有〜10种不同的文件类型共享10-15个常用字段。但是每个文档类型都有其他字段,其中3个字段多达30-40个。



我正在考虑为每个文档类型使用不同的映射类型。但是如果我正确地了解映射的工作原理,ElasticSearch将在内部使用150-200个字段的映射。因为没有文档对每个字段都有一个值,所以我会得到很多稀疏数据。



根据这篇文章(索引与类型)ElasticSearch在处理稀疏数据时不是很好,所以这将是一个参数具有每个文档类型的单独索引。但是一些文档类型只有很少的文档,所以要为它们分配一个索引是太过分了。



我的问题:稀疏文档有多糟糕?或者我更喜欢每个类型的单独的索引,即使一些索引只包含几个文档?

解决方案


ElasticSearch将内部使用150-200个字段的映射。
由于没有文档对每个字段都有一个值,所以我将得到一个
的稀疏数据。


是的,索引中的不同类型共享相同的映射结构。每个类型的每个文档都有一个_type字段,用于在特定类型搜索时自动用于过滤的每个文档。


稀疏文件?


引用自索引Vs类型



一种类型中存在的字段也将消耗此字段所用类型文档的资源不存在。这是Lucene指数的一般问题:他们不喜欢稀疏。


我最好使用单独的索引对于每种类型,即使一些
索引将只包含几个文档?


您可能会注意到每个单独的索引有自己的开销和类型不能很好地与稀疏的文件。



我会建议




  • 具有少量文档(大量稀疏字段)的文档类型应该转到单独的索引,显然通过将碎片数减少到最小可能的数字,即1.每个索引默认有5个碎片。如果您的文档数量不是很大,那么使用5个分片是没有意义的,它会减少搜索查询的负担。

  • 具有相同领域的文档类型应该是与不同类型的索引相同。根据文档总数,您可能希望增加分片数量设置。

  • 如果某些文档类型有大量文档,则可能希望为其创建单独的索引。



请记住,您应该在群集中保留合理数量的碎片,这可以通过减少碎片数量来实现不需要高写入吞吐量和/或将存储少量文档的索引。


I have ~10 different document types which share 10-15 common fields. But each document type has additional fields, 3 of them up to 30-40 additional fields.

I was considering to use a different mapping type for each document type. But if I correctly understand how mappings work, ElasticSearch will internally use one mapping with 150-200 fields. Because no document has a value for each field, I will end up with a lot of sparse data.

According to this article (Index vs. Type) ElasticSearch is (was?) not very good in dealing with sparse data, so that would be an argument for having a separate index for each document type. But some document types only have very little documents, so it would be overkill to have a separate index for them.

My question: How bad are sparse documents? Or am I better off with a separate index for each type even though some indexes will only contain a few documents?

解决方案

ElasticSearch will internally use one mapping with 150-200 fields. Because no document has a value for each field, I will end up with a lot of sparse data.

Yes, different types within an index share the same mapping structure. Each type just have a "_type" field to every document that is automatically used for filtering when searching on a specific type.

How bad are sparse documents?

Citing from Index Vs Type

Fields that exist in one type will also consume resources for documents of types where this field does not exist. This is a general issue with Lucene indices: they don’t like sparsity.

am I better off with a separate index for each type even though some indexes will only contain a few documents?

As you may be aware that each separate index has its own overhead and types don't gel well with sparse documents.

I would suggest

  • Document Types with small number of documents (with large number of sparse fields) should go to a separate index, obviously by reducing the number of shards to the least possible number i.e. 1. Each index has 5 shards by default. If your number of docs are not that large, it doesn't make sense to use 5 shards and it will reduce the load on search query.
  • Document Types having significant fields in common should go to the same index with different types. Depending on the total number of docs, you may like to increase the number of shards setting.
  • If some document types have a huge number of documents, you may like to create separate indices for them.

Keep in mind that you should keep a reasonable number of shards in your cluster, which can be achieved by reducing the number of shards for indices that don’t require a high write throughput and/or will store low numbers of documents.

这篇关于多个索引或多个映射类型用于稀疏文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆