Cassandra 对具有不同分区键的表的批量查询性能 [英] Cassandra batch query performance on tables having different partition keys
问题描述
我有一个测试用例,我每秒从客户端收到 15 万个请求.
我的测试用例需要插入UNLOGGED批处理
到多个表并具有不同的分区键
BEGIN UNLOGGED BATCH更新 kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='Country' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('US')更新 kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='City' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Dallas')更新 kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='State' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Texas')更新 kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='SSN' and ptype='text' and date='2017-03-20' and pvalue=decimalAsBlob(000000000);更新 kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='Gender' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Female')申请批次
有没有比我目前遵循的方法更好的方法?
因为目前,我正在批量插入可能存在于不同集群中的多个表,因为它们具有不同的分区键,据我所知,将批量查询插入到具有不同分区键的不同表有额外的权衡.
首先,了解批处理的用例很重要.
<块引用><块引用>批处理经常被错误地用于尝试优化性能.
批处理用于维护多个表之间的数据一致性.如果需要原子性,则使用记录的批处理.如果在您的情况下,这是一个计数器表,并且表之间的计数不需要一致,则不要使用批处理.如果集群没问题,Cassandra 会确保所有写入都成功.
<块引用><块引用>未记录的批处理需要协调器来管理插入,这会给协调器节点带来沉重的负载.如果其他节点拥有分区键,则协调器节点需要处理一个网络跃点,导致交付效率低下.对同一分区键进行更新时使用未记录的批次.
请关注以下文章:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html
I have test case in which I receive 150k requests per second from a client.
My test case requires inserting UNLOGGED batch
to multiple tables and having different partition keys
BEGIN UNLOGGED BATCH
update kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='Country' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('US')
update kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='City' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Dallas')
update kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='State' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Texas')
update kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='SSN' and ptype='text' and date='2017-03-20' and pvalue=decimalAsBlob(000000000);
update kspace.count_table set counter=counter+1 where source_id= 1 and name='source_name' and pname='Gender' and ptype='text' and date='2017-03-20' and pvalue=textAsBlob('Female')
APPLY BATCH
Is there a better way than the current way that I i'm following?
because currently, I am batch inserting to multiple tables that may be present in the different clusters as they have the different partition key and as of my knowledge inserting batch queries to different tables having different partision key have extra tradeoff.
At first, it is important to know the use case of batch.
Batches are often mistakenly used in an attempt to optimize performance.
Batches are used to maintain data consistency among multiple tables. If atomicity is needed, logged batch is used. If in your case, this is a counter table and if counts among tables do not need to be consistent, then do not use batch. If you cluster is okay, Cassandra ensures all writes to be sucessful.
Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
Please follow below articles:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html
这篇关于Cassandra 对具有不同分区键的表的批量查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!