ElasticSearch:索引vs类型和处理更新 [英] ElasticSearch: Index vs type and handling updates

查看:185
本文介绍了ElasticSearch:索引vs类型和处理更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常熟悉ElasticSearch的功能及其优点,但这是我第一次弄脏构建索引。所以我很想得到专家的审查。



要求:



我们的应用程序从多个来源获取有关各种产品的元数据。通常,这些Feed作为XML文件(文件大小可以从2 GB变化到12 GB,有时单个记录分布在多个文件中),并且来自一个提供者的Feed中包含的信息可能与其他信息中包含的信息重叠或不重叠。所以我们去掉这些数据,根据产品类型将这些数据标准化为一组通用格式,我们需要提供搜索这个综合数据集的能力(显然这是ElasticSearch所在的地方)。 p>

所有产品都具有一些常见标识符(如ID,价格等),但核心元数据在不同产品类型之间可能看起来完全不同。为了量化这一点,假设所有产品共有30%的领域,70%的领域在产品类型之间有所不同。没有太多的产品类型,可以安全地假设在任何时候不会有超过10种类型。从数字开始要小得多(约3-4)。



此外,这些数据源可能会随机更新一些更新,其中一些更新需要立即(近实时)反映在搜索中,而不会降低搜索能力。



提出的解决方案:



根据我在这里阅读的内容,我正在考虑为不同的产品类型提供不同的索引: https://www.elastic.co/blog/index-vs-type 。因此,归一化作业将查看源文件,为给定的产品类型创建标准化结构,并将其添加到相应的索引。我们公开的搜索API将针对所使用的搜索词的每个索引执行搜索,并将结果合并到具有多个部分(每个产品类型的一个部分)的单个JSON响应中。



对于更新,我们计划使用批量API进行更新,插入和删除,并且考虑到REST API调用的限制,我们必须分批进行这些调用x MB。



问题:



这是组织数据并在ElasticSearch中进行更新的最佳方法为我的用例)?在同一个索引上使用多个类型(例如:包含typeA,typeB等的产品的产品)而不是为每个产品类型创建一个索引会更好吗?如果是这样,搜索会比索引要快得多吗?在创建索引后,是否有更好的方法来处理记录的CRUD?



提前感谢

解决方案

首先值得注意的是,映射类型将在下一个ES版本中消失(ES 6中的软禁用和ES 7中的删除)。



现在无论类型是否消失,仍然可以使用单个索引,但是您会增加稀疏性,因为只有30%的字段是常见的,而应该避免



所以,我想说,你的多指标方法是唯一一个根据你的数据的性质而有意义的方法。



额外的信息值得读: https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html


I'm pretty familiar with the capabilities of ElasticSearch and its benefits, but this is the first time I'm getting my hands dirty building an index. So I'm eager to get the following approach vetted with experts.

Requirement:

Our application gets metadata about various products from multiple sources. Typically these feeds come in as XML files (file size can vary from 2 GB to 12 GB and sometimes a single record is spread across multiple files) and the information contained in the feed from one provider may or may not overlap with information contained in others. So we de-dupe this data, normalize this data to a set of common formats, depending on the product type, and we need to provide the ability to search against this consolidated data set (obviously this is where ElasticSearch comes in).

All products have certain common identifiers (like id, price, etc.), but the core metadata can look completely different between different product types. To quantify this, let's say all products have 30% fields in common and 70% fields differ between product types. There aren't too many product types and it's safe to assume that there will not be more than 10 types at any point in time. To start with the number is much smaller (around 3-4).

Additionally, there can be updates coming in at random intervals from these data sources and some of these updates need to be reflected in searches right away (near real time), without bringing down the search capability.

The proposed solution:

I'm considering having different indexes for different product types, based on what I read here: https://www.elastic.co/blog/index-vs-type. So the normalization job will look at the source files, create the normalized structure for the given product type and add it to the appropriate index. The search API that we expose will perform a search against each of these indices for the search term used and consolidate the results into a single JSON response with multiple sections (one section for each product type).

For updates, we plan on using the bulk API for update, insert and delete and given that the limitations around REST API calls, we will have to make these calls in batches of x MB each.

Question:

Is this the best way to organize the data and update it in ElasticSearch (for my use-case)? Would it be better to use multiple types on the same index (example: /products containing products of type typeA, typeB, etc.) instead of creating one index per product type? If so, will the search be significantly faster than searching across indices? Are there better ways to handle the CRUD of records after the index has been created?

Thanks in advance!

解决方案

First it is worth noting that mapping types will go away in the next ES release (soft deprecation in ES 6 and removal in ES 7).

Now whether types go away or not, it is still possible to use a single index, however you'd increase sparsity since only 30% of your fields are common and that should be avoided at all cost.

So, I'd say that your multi-index approach is the only one that makes sense given the nature of your data.

Additional information worth reading: https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html

这篇关于ElasticSearch:索引vs类型和处理更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆