数据库复制支持的数据结构设计 [英] Data structure design for database replication support

查看:128
本文介绍了数据库复制支持的数据结构设计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个小型ad-hoc复制框架(对于reatil公司),它仅根据特定的域级逻辑复制某些表(约200)的某些记录。

I'm developing a small ad-hoc replication framework (for a reatil company), which replicates only certain records of certain tables (about 200) depending on specific domain-level logic.

为了知道每个目标主机的每个记录的复制状态,我有一个 repStatus 字符(NUMBER_OF_HOSTS)类型列;

In order to know the replication status of each record for each target host, I have a repStatus character(NUMBER_OF_HOSTS) type column; where a host always represents the same position.

此列的每个位置的值可以是 0 无操作 , 1 复制记录), 2 strong (第一次), 第二次)错误)等。

The values in each position of this column could be 0 (no action), 1 (replicate record), 2 (record replicated), 3 (resend after confirmation), A (first time error), B (second time error)... etc.

例如: 012A 表示:


  • 不向主机1发送任何内容

  • 将此记录发送到主机2


这看起来很容易和简单,并且有直接读取:为了知道记录的状态,我只需阅读 repStatus 列。

This looks pretty easy and simple, and has a "straightforward reading": in order to know the status of a record, I simply read the repStatus column.

但是,看起来像这种方法会导致应用程序出现性能问题时要查找要复制的目标记录

However, looks like this approach leads to performance problems when the application has to look for the target records to replicate.

所以我相信有一个更好的设计来解决这个问题的性能。引用表,记录和主机的一个引用表可能是一个解决方案:

So I'm sure there's a better design to solve this problem performance-wise. Maybe an aditional table referencing table, record and host could be a solution:

CREATE TABLE repStatus (tableID int, recordID int, targetHostID int, status int);

其中状态值现在甚至可以归一化为新表。然而,每表200个表*〜500000条记录可能是在单个表中以任何方式处理的相当大的行数。

Where status value now could even be normalized to a new table. However 200 tables * ~500000 records per table might be a considerably large number of rows to handle in a single table any way.

欢迎任何基于经验的替代方案。 / p>

Any alternative based on experience is welcome.

推荐答案

好吧,我要做的第一件事是删除icky字符串解析到处,并替换为PostgreSQL本机类型。要将复制状态存储在与当前解决方案类似的每个记录上:

Well, the first thing I would do is to drop the icky string parsing everywhere and replace it with PostgreSQL native types. To store the replication status on each record similar to your current solution:

CREATE TYPE replication_status AS ENUM (
  'no_action',
  'replicate_record',
  'record_replicated',
  'error_1',
  'error_2',
  'error_3'
  );
ALTER TABLE t ADD COLUMN rep_status_array replication_status[];

这会花费更多的存储空间 - 枚举值为4字节而不是1,一些开销。但是,通过教数据库你的概念,而不是隐藏它们,你可以写如下:

This costs you a little more storage space -- enum values are 4 bytes instead of 1 and arrays have some overhead. However, by teaching the database your concepts instead of hiding them, you can write things like:

-- find all records that need to be replicated to host 4
SELECT * FROM t WHERE rep_status_array[4] = 'replicate_record';

-- find all records that contain any error status
SELECT * FROM t WHERE rep_status_array &&
  ARRAY['error_1', 'error_2', 'error_3']::replication_status[];

您可以将GIN索引放在 rep_status_array 如果这有助于您的用例,但最好是查看您的查询,并专门为您使用的创建索引:

You can put a GIN index right on rep_status_array if that helps your use case, but it's better to look at your queries and create indexes specifically for what you use:

CREATE INDEX t_replication_host_4_key ON t ((rep_status_array[4]));
CREATE INDEX t_replication_error_key ON t (id)
  WHERE rep_status_array && ARRAY['error_1', 'error_2', 'error_3']::replication_status[];

也就是说,给定200个表,我会试图将它分成单个复制状态表 - 每行包含一个状态数组或每个主机一行,具体取决于复制逻辑的其余部分如何工作。我仍然使用该枚举:

That said, given 200 tables, I'd be tempted to split this out into a single replication status table -- either one row with an array of statuses or one row per host, depending on how the rest of the replication logic works. I'd still use that enumeration:

CREATE TABLE adhoc_replication (
  record_id bigint not null,
  table_oid oid not null,
  host_id integer not null,
  replication_status status not null default 'no_action',
  primary key (record_id,table_oid,host_id)
  );

PostgreSQL在内部为每个表分配一个OID(尝试 SELECT *,tableoid FROM t LIMIT 1 ),这是单个数据库系统中方便的稳定数字标识符。换句话说,如果表被删除并重新创建(如果您例如转储和恢复数据库,则会发生这种情况),并且出于同样的原因,开发和生产之间很可能不同。如果您希望在添加或重命名表时交换这些情况,请使用枚举而不是OID。

PostgreSQL internally assigns each table an OID (try SELECT *, tableoid FROM t LIMIT 1), which is a convenient stable numeric identifier within a single database system. Put another way, it changes if the table is dropped and recreated (which can happen if you e.g. dump and restore the database), and for this same reason it's very likely different between development and production. If you'd rather have these situations work in exchange for breaking when you add or rename a table, use an enumeration instead of an OID.

使用单个表格复制将允许您轻松地重复使用触发器和查询,以及将大多数复制逻辑与其复制的数据解耦。它还允许您通过引用单个索引,基于所有原始表中给定主机的状态进行查询,这可能很重要。

Using a single table for all replication would allow you to easily re-use triggers and queries and such, decoupling most replication logic from the data it's replicating. It also allows you to query based on status for a given host across all your origin tables by referencing a single index, which could be important.

对于表大小,PostgreSQL可以明确地处理同一个表中的1000万行。如果您使用专用的复制相关表格,则可以随时分区< a>每个主机。 (按表分区对我来说没什么意义;它似乎比在每个上游行上存储复制状态更糟)。哪种分区方式或者它是否合适完全取决于您打算向数据库询问什么样的问题,在基表上发生什么类型的活动。 (分区意味着保留许多较小的blob而不是几个大的blob,并且可能访问许多较小的blob来执行单个操作。)真正的选择是什么时候你想要你的磁盘寻求发生。

As for table size, PostgreSQL can definitely handle 10 million rows in the same table. If you went with a dedicated replication-related table, you could always partition per host. (Partitioning by table makes little sense to me; it seems worse than storing replication status on each upstream row.) Which way to partition or whether or not it's appropriate at all depends entirely on what kind of questions you intend to ask your database, and what sort of activity happens on the base tables. (Partitioning means maintaining many smaller blobs instead of a few large ones, and potentially accessing many smaller blobs to perform a single operation.) It's really a matter of choosing when you want your disk seeks to happen.

这篇关于数据库复制支持的数据结构设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆