卡桑德拉-一张大桌子vs多张桌子 [英] Cassandra - one big table vs many tables

查看:44
本文介绍了卡桑德拉-一张大桌子vs多张桌子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试试用Cassandra数据库。
我正在使用DataStax开发中心和DataStax C#驱动程序。

I'm currently looking trying out Cassandra database. I'm using DataStax Dev center and DataStax C# driver.

我的当前模型非常简单,仅包含以下内容:

My Current model is quite simple and consists of only:


  • ParameterId(int)-将用作表的ID。

  • 值(bigint)

  • MeasureTime(时间戳)

我将拥有1000个参数(不多,不少),范围是1-1000。并且一次将获得每个参数的条目。第二个并且将运行多年。

I will be having 1000 (no more, no less) parameters, from 1 - 1000. And will be getting an entry for each parameter once pr. second and will be running for years.

我的问题是关于创建表是否更好:

My question is regarding whether it is better practice to create a table as:

CREATE TABLE keyspace.measurement (
    parameterId int,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY(parameterId, measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

或者最好创建1000表格仅由一个值和measureTime组成,如果可以的话,是否可以对MeasureTime进行范围查询?

Or it would be better to create 1000 tables consisting only of a value and measureTime, and if so would I be able to range query on my MeasureTime?

推荐答案

打算用它打很多行我不建议您使用表格格式,并且会选择允许您控制行宽的内容。

You are going to hit very wide rows with this. I would advise against your table format, and I'd go with something that allows you to control the wideness of the rows.

根据您的查询要求,我将为您写下一个更合适的架构(IMHO):

Depending on your query requirements, I'll write you down a more suitable schema (IMHO):

CREATE TABLE keyspace.measurement (
    parameterId int,
    granularity timestamp,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY((parameterId, granularity), measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

这与您的非常相似,但是它具有一个主要优点:您可以配置行的宽度,并且没有任何热点。这个想法简直太简单了: parameterId granularity 字段都是 partition key ,所以他们说出您的数据将流向何处,而 measureTime 将使您的数据保持有序。假设您要每天查询一次,则将值 yyyy-mm-dd 粒度中您的 measureTime 中的$ c>,将同一天的所有度量组合在一起。

This is very similar to yours, however it has a major advantage: you can configure the wideness of your rows, and you don't have any hotspots. The idea is dead simple: both parameterId and granularity fields make the partition key, so they tell where your data will go, while measureTime will keep your data ordered. Supposing you want to query on a day-by-day basis, you'd store into granularity the value yyyy-mm-dd of your measureTime, grouping together all the measures of the same day.

这使您可以检索位于同一分区上的所有值(因此,每个给定的 parameterId 粒度字段对)与有效范围查询。在日常配置中,每个分区最终可以存储86400条记录。此数字可能仍然很高(建议的上限为1万IIRC),您可以通过使用 yyyy-mm-dd HH:00

This allows you to retrieve all the values lying on the same partition (so per given parameterId and granularity fields pair) with an efficient range query. In a day-by-day configuration, you'd end up with 86400 records per partition. This number could be still high (the suggested limit is 10k IIRC), and you can lower tht value by going on hour-by-hour grouping with yyyy-mm-dd HH:00 value instead.

这种方法的缺点是,如果您需要来自多个分区的数据(例如,每天分组,但您需要连续两天的数据,例如1月19日的最后6个小时和1月20日的前6个小时),则您需要执行多个查询。

The drawback of that approach is that if you need data from multiple partitions (eg you are grouping on day-by-day basis, but you need data for two consecutive days, eg the last 6 hours of the Jan 19th, and the first 6 hours of Jan 20th), then you'll need to perform multiple queries.

这篇关于卡桑德拉-一张大桌子vs多张桌子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆