Cassandra中的数据建模,其列可以是文本或数字 [英] data modeling in Cassandra with columns that can be text or numbers
问题描述
我的表格有5列。
1. ID - number but it can stored as text or number
2. name - text
3. date - date value but can stored as date or text
4. time - number but it can stored as text or number
5. rating - number but it can stored as text or number
我想找到哪种数据类型将使我的表写得更快。我怎么找到。那里有卡桑德拉(Cassandra)压力Yaml吗?
I want to find which data type will make my table faster for write. How can I find. Any Cassandra stress yaml for this there?
推荐答案
关于答案,与Cassandra 2.1或Cassandra 2.2相同(但Cassandra 3.0可能是与团队当前正在重写存储引擎的情况不同,请参见 CASSANDRA-8099 )。数据存储仍以二进制形式存储。
Regarding answer that @BryceAtNetwork23 provided, it will be the same with Cassandra 2.1 or in Cassandra 2.2 (but Cassandra 3.0 will probably be a different story as the team is currently rewriting the storage engine, see CASSANDRA-8099). Data is stored is still stored in binary.
不过,还有更多要说的。您可能要考虑存储的实际数据以及项目需要实现的性能,每秒查询等。
However there's more to say there. And you may want to consider the actual data being stored, and the performance your project need to achieve, query per seconds, etc.
取决于这些目标或约束,这很有趣一种方法是查看给定在cassandra上输入。
Depending on these goals or constraints an interesting approach is to have a look at the size of the serialized data for a given type on cassandra.
-
如果数据是数字,例如带有<$ c $ Java中c> long 的大小为8个字节,cassandra
bigint
类型的大小匹配,这意味着序列化时没有相关的成本,纯副本即可。同样,这样做的好处是密钥足够小,因此不会cassandra密钥缓存。
If the data is a number, for example with a
long
in Java that has a size 8 bytes, there's a match the cassandrabigint
type in size, that mean there's no cost associated when serializing, a plain copy will do. Also this has the benefit that the key is small enough so that it doesn't stress cassandra key cache.
如果数据是一段文本,例如Java中的 String
,在运行时使用UTF-16进行编码,但是在Cassandra中使用 text进行序列化时
类型,然后使用UTF-8。 UTF-16始终使用2个字节每个字符,有时使用4个字节,但是UTF-8节省空间,并且取决于字符长度可以是1、2、3或4个字节。
If the data is a piece of text, for example a String
in Java, which is encoded in UTF-16 in the runtime, but when serialized in Cassandra with text
type then UTF-8 is used. UTF-16 always use 2 bytes per character and sometime 4 bytes, but UTF-8 is space efficient and depending on the character can be 1, 2, 3 or 4 bytes long.
这意味着要进行编码/解码目的是要序列化此类数据的CPU工作。同样取决于文本,例如 158786464563
,数据将以12个字节存储。
That mean that there's CPU work to serialize such data for encoding/decoding purpose. Also depending on the text for example 158786464563
, data will be stored with 12 bytes. That means more space is used and more IO as well.
请注意,cassandra提供了紧随美国之后的 ascii
类型。 -ASCII字符集,并始终使用每个字符1个字节。
Note cassandra offers the ascii
type that follows the US-ASCII character set and is always using 1 byte per character.
如果数据是UUID(128位值),则在Java中 UUID
类型使用2个 long
s,因此它的长度为16个字节,Cassandra也将它们存储为16个字节(它们使用Java UUID类型)。
If data is a UUID (a value of 128 bits), in Java the UUID
type uses 2 long
s so it is 16 bytes long, and Cassandra store them as 16 bytes as well (they use the Java UUID type).
同样,这始终取决于项目的进度,目标是什么,现有的限制条件。但是,这是我的未受过教育的选项:
Again that always depend on the mileage of your project, what are the goals, existing constraints. But here's my un-educated options :
- 如果必须插入的数据始终是一个数字,在长期
内[−9,223,372,036,854,775,808; +9,223,372,036,854,775,807]
,我会得到一个bigint
类型 - UUID很好
- 如果群集的负载不大(例如每秒10万次查询)并且空间不是问题,则
text
不是问题,但是如果是或者如果使用量可能会增加,我会尽可能避免使用text
作为密钥。
- If the data that has to be inserted is always a number that is inside the long range
[−9,223,372,036,854,775,808 ; +9,223,372,036,854,775,807]
, I'll got for abigint
type - UUID is fine
- If the cluster is not under heavy load (like 100k query per seconds) and space is not an issue then
text
is not an issue, but if it is or if usage may grow I'd avoidtext
for key if possible.
另一种选择是使用 blob
类型,即二进制类型,可以根据软件的业务以所需的方式使用任何数据。这可以实现空间高效,IO高效的存储,也可以实现CPU高效。但是根据需要,可能有必要在客户端代码中管理很多事情,例如排序,序列化,比较,映射等。
Another option is to use a blob
type, i.e. a binary types, where it is possible to use any data the way you want according to the business of the software. This could allow space efficient, IO efficient storage, and to CPU efficient as well. But depending on the needs it may be necessary to manage a lot of things in the client code, like ordering, serialization, comparison, mapping, etc...
这篇关于Cassandra中的数据建模,其列可以是文本或数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!