ElasticSearch:索引与类型和处理更新 [英] ElasticSearch: Index vs type and handling updates

查看:25
本文介绍了ElasticSearch:索引与类型和处理更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常熟悉 ElasticSearch 的功能及其优势,但这是我第一次动手构建索引.因此,我很想与专家一起审查以下方法.

要求:

我们的应用程序从多个来源获取有关各种产品的元数据.通常,这些提要以 XML 文件的形式出现(文件大小可以从 2 GB 到 12 GB 不等,有时单个记录分布在多个文件中)并且来自一个提供商的提要中包含的信息可能会或可能不会与其他提供商中包含的信息重叠.因此,我们对这些数据进行重复数据删除,将这些数据标准化为一组通用格式,具体取决于产品类型,并且我们需要提供针对此合并数据集进行搜索的功能(显然这就是 ElasticSearch 的用武之地).

所有产品都有某些通用标识符(如 ID、价格等),但核心元数据在不同产品类型之间可能看起来完全不同.为了量化这一点,假设所有产品都有 30% 的共同字段,70% 的字段在产品类型之间不同.产品类型不多,可以安全地假设在任何时候都不会超过 10 种类型.首先这个数字要小得多(大约 3-4 个).

此外,可能会有来自这些数据源的随机间隔更新,其中一些更新需要立即(接近实时)反映在搜索中,而不会降低搜索能力.

建议的解决方案:

我正在考虑根据我在此处阅读的内容为不同的产品类型设置不同的索引:https://www.elastic.co/blog/index-vs-type.因此规范化作业将查看源文件,为给定的产品类型创建规范化结构并将其添加到适当的索引中.我们公开的搜索 API 将针对所使用的搜索词针对这些索引中的每一个执行搜索,并将结果合并为具有多个部分的单个 JSON 响应(每个产品类型一个部分).

对于更新,我们计划使用批量 API 进行更新、插入和删除,并且鉴于 REST API 调用的限制,我们将不得不每次进行 x MB 的批量调用.

问题:

这是在 ElasticSearch 中组织数据和更新数据的最佳方式吗(对于我的用例)?在同一个索引上使用多个类型(例如:/products 包含类型 A、类型 B 等的产品)而不是为每个产品类型创建一个索引会更好吗?如果是这样,搜索是否会比跨索引搜索快得多?是否有更好的方法来处理创建索引后记录的 CRUD?

提前致谢!

解决方案

首先值得注意的是 映射类型将在下一个 ES 版本中消失(在 ES 6 中软弃用,在 ES 7 中删除).

现在无论类型是否消失,仍然可以使用单个索引,但是您会增加稀疏性,因为只有 30% 的字段是常见的,并且 应该不惜一切代价避免.

所以,我想说,鉴于您的数据的性质,您的多索引方法是唯一有意义的方法.

其他值得一读的信息:https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html

I'm pretty familiar with the capabilities of ElasticSearch and its benefits, but this is the first time I'm getting my hands dirty building an index. So I'm eager to get the following approach vetted with experts.

Requirement:

Our application gets metadata about various products from multiple sources. Typically these feeds come in as XML files (file size can vary from 2 GB to 12 GB and sometimes a single record is spread across multiple files) and the information contained in the feed from one provider may or may not overlap with information contained in others. So we de-dupe this data, normalize this data to a set of common formats, depending on the product type, and we need to provide the ability to search against this consolidated data set (obviously this is where ElasticSearch comes in).

All products have certain common identifiers (like id, price, etc.), but the core metadata can look completely different between different product types. To quantify this, let's say all products have 30% fields in common and 70% fields differ between product types. There aren't too many product types and it's safe to assume that there will not be more than 10 types at any point in time. To start with the number is much smaller (around 3-4).

Additionally, there can be updates coming in at random intervals from these data sources and some of these updates need to be reflected in searches right away (near real time), without bringing down the search capability.

The proposed solution:

I'm considering having different indexes for different product types, based on what I read here: https://www.elastic.co/blog/index-vs-type. So the normalization job will look at the source files, create the normalized structure for the given product type and add it to the appropriate index. The search API that we expose will perform a search against each of these indices for the search term used and consolidate the results into a single JSON response with multiple sections (one section for each product type).

For updates, we plan on using the bulk API for update, insert and delete and given that the limitations around REST API calls, we will have to make these calls in batches of x MB each.

Question:

Is this the best way to organize the data and update it in ElasticSearch (for my use-case)? Would it be better to use multiple types on the same index (example: /products containing products of type typeA, typeB, etc.) instead of creating one index per product type? If so, will the search be significantly faster than searching across indices? Are there better ways to handle the CRUD of records after the index has been created?

Thanks in advance!

解决方案

First it is worth noting that mapping types will go away in the next ES release (soft deprecation in ES 6 and removal in ES 7).

Now whether types go away or not, it is still possible to use a single index, however you'd increase sparsity since only 30% of your fields are common and that should be avoided at all cost.

So, I'd say that your multi-index approach is the only one that makes sense given the nature of your data.

Additional information worth reading: https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html

这篇关于ElasticSearch:索引与类型和处理更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆