Cassandra中的数据建模,其列可以是文本或数字 [英] data modeling in Cassandra with columns that can be text or numbers

查看:92
本文介绍了Cassandra中的数据建模,其列可以是文本或数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的表格有5列。

    1. ID -  number but it can stored as text or number
    2. name - text
    3. date - date value but can stored as date or text
    4. time - number but it can stored as text or number
    5. rating - number but it can stored as text or number

我想找到哪种数据类型将使我的表写得更快。我怎么找到。那里有卡桑德拉(Cassandra)压力Yaml吗?

I want to find which data type will make my table faster for write. How can I find. Any Cassandra stress yaml for this there?

推荐答案

关于答案,与Cassandra 2.1或Cassandra 2.2相同(但Cassandra 3.0可能是与团队当前正在重写存储引擎的情况不同,请参见 CASSANDRA-8099 )。数据存储仍以二进制形式存储。

Regarding answer that @BryceAtNetwork23 provided, it will be the same with Cassandra 2.1 or in Cassandra 2.2 (but Cassandra 3.0 will probably be a different story as the team is currently rewriting the storage engine, see CASSANDRA-8099). Data is stored is still stored in binary.

不过,还有更多要说的。您可能要考虑存储的实际数据以及项目需要实现的性能,每秒查询等。

However there's more to say there. And you may want to consider the actual data being stored, and the performance your project need to achieve, query per seconds, etc.

取决于这些目标或约束,这很有趣一种方法是查看给定在cassandra上输入

Depending on these goals or constraints an interesting approach is to have a look at the size of the serialized data for a given type on cassandra.


  • 如果数据是数字,例如带有<$ c $ Java中c> long 的大小为8个字节,cassandra bigint 类型的大小匹配,这意味着序列化时没有相关的成本,纯副本即可。同样,这样做的好处是密钥足够小,因此不会 cassandra密钥缓存。

  • If the data is a number, for example with a long in Java that has a size 8 bytes, there's a match the cassandra bigint type in size, that mean there's no cost associated when serializing, a plain copy will do. Also this has the benefit that the key is small enough so that it doesn't stress cassandra key cache.

如果数据是一段文本,例如Java中的 String ,在运行时使用UTF-16进行编码,但是在Cassandra中使用 text进行序列化时类型,然后使用UTF-8。 UTF-16始终使用2个字节每个字符,有时使用4个字节,但是UTF-8节省空间,并且取决于字符长度可以是1、2、3或4个字节。

If the data is a piece of text, for example a String in Java, which is encoded in UTF-16 in the runtime, but when serialized in Cassandra with text type then UTF-8 is used. UTF-16 always use 2 bytes per character and sometime 4 bytes, but UTF-8 is space efficient and depending on the character can be 1, 2, 3 or 4 bytes long.

这意味着要进行编码/解码目的是要序列化此类数据的CPU工作。同样取决于文本,例如 158786464563 ,数据将以12个字节存储。

That mean that there's CPU work to serialize such data for encoding/decoding purpose. Also depending on the text for example 158786464563, data will be stored with 12 bytes. That means more space is used and more IO as well.

请注意,cassandra提供了紧随美国之后的 ascii 类型。 -ASCII字符集,并始终使用每个字符1个字节

Note cassandra offers the ascii type that follows the US-ASCII character set and is always using 1 byte per character.

如果数据是UUID(128位值),则在Java中 UUID 类型使用2个 long s,因此它的长度为16个字节,Cassandra也将它们存储为16个字节(它们使用Java UUID类型)。

If data is a UUID (a value of 128 bits), in Java the UUID type uses 2 longs so it is 16 bytes long, and Cassandra store them as 16 bytes as well (they use the Java UUID type).

同样,这始终取决于项目的进度,目标是什么,现有的限制条件。但是,这是我的未受过教育的选项:

Again that always depend on the mileage of your project, what are the goals, existing constraints. But here's my un-educated options :


  • 如果必须插入的数据始终是一个数字,在长期内[−9,223,372,036,854,775,808; +9,223,372,036,854,775,807] ,我会得到一个 bigint 类型

  • UUID很好

  • 如果群集的负载不大(例如每秒10万次查询)并且空间不是问题,则 text 不是问题,但是如果是或者如果使用量可能会增加,我会尽可能避免使用 text 作为密钥。

  • If the data that has to be inserted is always a number that is inside the long range [−9,223,372,036,854,775,808 ; +9,223,372,036,854,775,807], I'll got for a bigint type
  • UUID is fine
  • If the cluster is not under heavy load (like 100k query per seconds) and space is not an issue then text is not an issue, but if it is or if usage may grow I'd avoid text for key if possible.

另一种选择是使用 blob 类型,即二进制类型,可以根据软件的业务以所需的方式使用任何数据。这可以实现空间高效,IO高效的存储,也可以实现CPU高效。但是根据需要,可能有必要在客户端代码中管理很多事情,例如排序,序列化,比较,映射等。

Another option is to use a blob type, i.e. a binary types, where it is possible to use any data the way you want according to the business of the software. This could allow space efficient, IO efficient storage, and to CPU efficient as well. But depending on the needs it may be necessary to manage a lot of things in the client code, like ordering, serialization, comparison, mapping, etc...

这篇关于Cassandra中的数据建模,其列可以是文本或数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆