Cassandra - 一张大桌子与多张桌子 [英] Cassandra - one big table vs many tables

查看:20
本文介绍了Cassandra - 一张大桌子与多张桌子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在试用 Cassandra 数据库.我正在使用 DataStax 开发中心和 DataStax C# 驱动程序.

I'm currently looking trying out Cassandra database. I'm using DataStax Dev center and DataStax C# driver.

我当前的模型非常简单,仅包含:

My Current model is quite simple and consists of only:

  • ParameterId (int) - 将用作表的 id.
  • 值(bigint)
  • MeasureTime(时间戳)

我将有 1000 个(不多也不少)的参数,从 1 到 1000.并且将获得每个参数的条目一次.第二个,将运行多年.

I will be having 1000 (no more, no less) parameters, from 1 - 1000. And will be getting an entry for each parameter once pr. second and will be running for years.

我的问题是关于创建表格是否更好的做法:

My question is regarding whether it is better practice to create a table as:

CREATE TABLE keyspace.measurement (
    parameterId int,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY(parameterId, measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

或者创建 1000 个仅包含一个值和 measureTime 的表会更好,如果是这样,我是否能够在我的 MeasureTime 上进行范围查询?

Or it would be better to create 1000 tables consisting only of a value and measureTime, and if so would I be able to range query on my MeasureTime?

推荐答案

你会遇到很宽的行.我会建议不要使用您的表格格式,并且我会使用一些可以让您控制行宽的东西.

You are going to hit very wide rows with this. I would advise against your table format, and I'd go with something that allows you to control the wideness of the rows.

根据您的查询要求,我会为您写下更合适的架构(恕我直言):

Depending on your query requirements, I'll write you down a more suitable schema (IMHO):

CREATE TABLE keyspace.measurement (
    parameterId int,
    granularity timestamp,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY((parameterId, granularity), measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

这与您的非常相似,但它有一个主要优点:您可以配置行的宽度,并且没有任何热点.这个想法非常简单:parameterIdgranularity 字段都构成 partition key,所以它们告诉你的数据去哪里,而 measureTime 将使您的数据保持有序.假设您想按天查询,您可以将 measureTime 的值 yyyy-mm-dd 存储到 granularity 中>,将同一天的所有度量组合在一起.

This is very similar to yours, however it has a major advantage: you can configure the wideness of your rows, and you don't have any hotspots. The idea is dead simple: both parameterId and granularity fields make the partition key, so they tell where your data will go, while measureTime will keep your data ordered. Supposing you want to query on a day-by-day basis, you'd store into granularity the value yyyy-mm-dd of your measureTime, grouping together all the measures of the same day.

这使您可以通过有效的范围查询检索位于同一分区上的所有值(根据给定的 parameterIdgranularity 字段对).在日常配置中,每个分区最终会有 86400 条记录.这个数字可能仍然很高(建议的限制是 10k IIRC),您可以通过使用 yyyy-mm-dd HH:00 值进行每小时分组来降低该值.

This allows you to retrieve all the values lying on the same partition (so per given parameterId and granularity fields pair) with an efficient range query. In a day-by-day configuration, you'd end up with 86400 records per partition. This number could be still high (the suggested limit is 10k IIRC), and you can lower tht value by going on hour-by-hour grouping with yyyy-mm-dd HH:00 value instead.

这种方法的缺点是,如果您需要来自多个分区的数据(例如,您正在逐日分组,但您需要连续两天的数据,例如 1 月 19 日的最后 6 小时,以及1 月 20 日的前 6 小时),那么您需要执行多个查询.

The drawback of that approach is that if you need data from multiple partitions (eg you are grouping on day-by-day basis, but you need data for two consecutive days, eg the last 6 hours of the Jan 19th, and the first 6 hours of Jan 20th), then you'll need to perform multiple queries.

这篇关于Cassandra - 一张大桌子与多张桌子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆