多表模式如何产生数据一致性问题? [英] How does multi table schema create data consistency issues?

查看:79
本文介绍了多表模式如何产生数据一致性问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据此


第二个表有复合键.PK(domain_id, item_id).所以,domain_id 是分区键 &item_id 将是聚类键.

GET 请求处理程序将访问(读取)两个表

POST 请求处理程序将访问(写入)到两个表中

PUT 请求处理程序将访问(写入)详细信息表(仅)


根据 CAP 定理,

  1. 多表模式的一致性问题是什么?在卡桑德拉...

  2. 我们能否避免 Cassandra 中的一致性问题?使用这些术语 QUORUM、一致性级别等...

解决方案

建议在 Cassandra 中使用单表.

我会推荐相反的.如果您必须在 Apache Cassandra 中支持对相同数据的多个查询,则应该为每个查询创建一个表.

<块引用>

多表模式的一致性问题是什么?在卡桑德拉...

当写入应用于一个表而不是另一个表时,可能会发生查询表之间的一致性问题.在这种情况下,应用程序应该有一种方法来优雅地处理它.如果出现问题,可能需要运行夜间作业以保持同步.

您也可能在表格中遇到一致性问题.可能在写入过程中发生了一些事情(节点崩溃,停机时间超过 3 小时,提示未重播).在这种情况下,给定的数据点可能只有其预期副本的一个子集.

可以通过定期进行维修来应对这种情况.此外,可以在每个查询的基础上提高一致性(QUORUM 与 ONE 等),并且 QUORUM 和更高的一致性级别偶尔会触发读取修复(同步当前操作中的所有副本).

<块引用>

我们能否避免 Cassandra 中的一致性问题?使用这些术语 QUORUM、一致性级别等...

因此,Apache Cassandra 被设计为具有高可用性 (HA),从而包含最终一致性的范式.有些人可能会将此解释为 Cassandra 是不一致的设计,他们不会错.我可以说,在支持 Web/零售规模的数百个集群几年之后,一致性问题(虽然确实发生过)很少见,而且通常是由 Cassandra 集群之外的组件故障引起的.

最终归结为应用程序的业务需求.对于产品评论或推荐等某些应用程序,稍微不一致应该不是问题.另一方面,诸如基于位置的定价之类的事情可能需要更高级别的查询一致性.如果 100% 一致性确实是一项硬性要求,我会质疑 Cassandra 是否是数据存储的正确选择.

编辑

<块引用>

我没有得到这个:当写入应用于一个表而不是另一个表时,可能会发生查询表之间的一致性问题.";当写入应用于一个表而不是另一个表时,会发生什么?

假设添加了一个新域.可能会出现 domain_details_table 更新但 id_table 没有更新的情况.数据库方面没有错.除非应用程序希望在 id_table 中找到那个 domain_id,但不能.

在这种情况下,也许应用程序可以使用 domain_details_table.domain_id 上的二级索引重试.它不会很快,但要做出的决定更多是围绕哪种场景更可取;没有回答,还是回答很慢?同样,应用要求在这里发挥作用.

<块引用>

对于您的观点:您也可能在表格中遇到一致性问题.可能在写入过程中发生了一些事情(节点崩溃,停机时间超过 3 小时,提示未重播)."RDBMS(如 MySQL)如何处理这个问题?

所以这个问题的答案过去很简单.RDBMS 仅在单个服务器上运行,因此只有一个副本可以保持同步.但是今天,大多数 RDBMS 都有可以使用的 HA 解决方案,因此必须保持同步.在这种情况下(据我所知),它们中的大多数将异步更新辅助副本,同时将流量限制为仅主副本.

记住 RDBMS 也通过锁定策略强制一致性也很好.因此,即使是单实例 RDBMS 也会在更新期间锁定数据点,阻止任何读取,直到锁定被释放.

在节点关闭的情况下,单实例 RDBMS 将完全脱机,因此您会丢失数据而不是不一致的数据.在 HA RDBMS 方案中,在故障转移到新的主数据库之前,会出现短暂的暂停(在此期间您可能会遇到连接/查询故障).副本启动后,可能需要额外的时间来同步副本,直到可以恢复 HA.

As per this answer, it is recommended to go for single table in Cassandra.

Cassandra 3.0


We are planning for below schema:


Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.

GET request handler will access(read) two tables

POST request handler will access(write) into two tables

PUT request handler will access(write) details table(only)


As per CAP theorem,

  1. What are the consistency issues in having multi-table schema? in Cassandra...

  2. Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...

解决方案

recommended to go for single table in Cassandra.

I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.

What are the consistency issues in having multi-table schema? in Cassandra...

Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.

You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.

This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).

Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...

So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.

Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.

Edit

I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?

So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.

In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.

For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?

So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.

It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.

In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.

这篇关于多表模式如何产生数据一致性问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆