其它系统的数据复制Elasticsearch? [英] Elasticsearch replication of other system data?

查看:152
本文介绍了其它系统的数据复制Elasticsearch?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想用elasticsearch来实现网站上的通用搜索。顶部的搜索栏,预计将发现整个网站的所有不同种类的资源。肯定文件(上传/通过蒂卡索引),但还搞什么客户,账户,其他人等。

Suppose I want to use elasticsearch to implement a generic search on a website. The top search bar would be expected to find resources of all different kinds across the site. Documents for sure (uploaded/indexed via tika) but also things like clients, accounts, other people, etc.

有关建筑方面的原因,大部分的非文件的东西(客户帐户)将存在于关系数据库中。

For architectural reasons, most of the non-document stuff (clients, accounts) will exist in a relational database.

在执行此搜索,选项#1。将创造一切的文档版本,然后只需使用elasticsearch运行搜索的各个方面,依托根本上寻找对象的不同类型的关系数据库。

When implementing this search, option #1 would be to create document versions of everything, and then just use elasticsearch to run all aspects of the search, relying not at all on the relational database for finding different types of objects.

选项#2是使用elasticsearch仅用于索引的文件,这意味着对于一般的网站搜索功能,你不得不多次搜索外包给多个系统,然后在返回之前聚集的结果。

Option #2 would be to use elasticsearch only for indexing the documents, which would mean for a general "site search" feature, you'd have to farm out multiple searches to multiple systems, then aggregate the results before returning them.

选项#1似乎很优越,但不足之处是,它要求在本质上弹性的搜索有很多事情在生产关系型数据库的副本,加上这些副本被保存为事情的变化清新。

Option #1 seems far superior, but the downside is that it requires that elastic search in essence have a copy of a great many things in the production relational database, plus that those copies be kept fresh as things change.

什么是在保持同步这些店最好的选择,而我在想,对于一般的搜索,选项#1优于正确吗?是否有一个选项#3?

What's the best option for keeping these stores in sync, and am I correct in thinking that for general search, option #1 is superior? Is there an option #3?

推荐答案

您已经pretty多列出的两个主要的选项有当谈到跨多个数据存储,搜索,即是在一个集中的数据存储搜索(选项​​1),或在所有的数据存储搜索和汇总的结果(选#2)。

You've pretty much listed the two main options there are when it comes to search across multiple data stores, i.e. search in one central data store (option #1) or search in all data stores and aggregate the results (option #2).

这两个选项会的工作,虽然选项#2有两个主要缺点:

Both options would work, although option #2 has two main drawbacks:


  1. 这需要逻辑的大量的在应用程序开发,以另辟蹊径的搜索到多个数据存储和聚合你回来的结果。

  2. 的响应时间可能是每个数据存储的不同,因此,你将不得不等待最慢的数据存储,除非您使用不同的规避这一点是为了应对present搜索结果给用户(异步技术,如阿贾克斯,WebSocket的,等等)

如果您想提供一个更好,更可靠的搜索体验,选项#1显然会得到我的选票(我走这条路的大部分时间其实)。正如你正确地指出,这个选项的主要缺点是,你需要保持在Elasticsearch同步与您的其他主数据存储的变化。

If you want to provide a better and more reliable search experience, option #1 would clearly get my vote (I take this way most of the time actually). As you've correctly stated, the main "drawback" of this option is that you need to keep Elasticsearch in synch with the changes in your other master data stores.

由于您的其他数据存储将是关系型数据库,你有几个不同的选择,让他们与Elasticsearch同步,即:

Since your other data stores will be relational databases, you have a few different options to keep them in synch with Elasticsearch, namely:

  • using the Logstash JDBC input
  • using the JDBC importer tool

这前两个选项工作的伟大,但有一个主要的缺点,即,它们不捕获将删除你的表,他们将只捕获INSERT和UPDATE。这意味着,如果你删除一个用户,账号等,你将无法知道,你必须删除Elasticsearch相应的文档。当然,除非你决定每次导入会话之前删除Elasticsearch指数。

These first two options work great but have one main disadvantage, i.e. they don't capture DELETEs on your table, they will only capture INSERTs and UPDATEs. This means that if you ever delete a user, account, etc, you will not be able to know that you have to delete the corresponding document in Elasticsearch. Unless, of course, you decide to delete the Elasticsearch index before each import session.

要缓解这种情况,可以使用另一种工具,它立足于MySQL的二进制日志,因此将能够捕捉到每一个事件。还有一个写在并之一的的Python

To alleviate this, you can use another tool which bases itself on the MySQL binlog and will thus be able to capture every event. There's one written in Go and one in Python.

这篇关于其它系统的数据复制Elasticsearch?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆