使用timeuuid作为列名称将日志数据写入Cassandra时性能不佳 [英] Bad performance when writing log data to Cassandra with timeuuid as a column name

查看：891 发布时间：2016/11/13 15:36:45 performance map cassandra data-modeling cql3

本文介绍了使用timeuuid作为列名称将日志数据写入Cassandra时性能不佳的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

按照指南中的指示操作ebay tech blog 和 datastax开发人员blog ，我在Cassandra 1.2中建模了一些事件日志数据。作为分区键，我使用ddmmyyhh | bucket，其中bucket是介于0和集群中节点数之间的任何数字。

Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use "ddmmyyhh|bucket", where bucket is any number between 0 and the number of nodes in the cluster.

数据模型

cqlsh：Log> CREATE TABLE事务（yymmddhh varchar，bucket int，
rId int，created timeuuid， data map，PRIMARY
KEY（（yymmddhh，bucket），created））;

cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int, rId int, created timeuuid, data map, PRIMARY KEY((yymmddhh, bucket), created) );

（rId标识触发事件的资源）
（map是从JSON导出的键值对;键改变，但不是很多）

(rId identifies the resource that fired the event.) (map is are key value pairs derived from a JSON; keys change, but not much)

我假设这转化为复合主/行键，每小时X个桶。
我的列名是timeuuids。查询此数据模型的工作原理如预期（我可以查询时间范围。）

I assume that this translates into a composite primary/row key with X buckets per hours. My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)

问题是性能：插入新行的时间不断增加。
所以我在做s.th.错了，但不能确定问题。

The problem is the performance: the time to insert a new row increases continuously. So I am doing s.th. wrong, but can't pinpoint the problem.

当我使用timeuuid作为行键的一部分时，性能在高水平上保持稳定，但这会阻止我查询它行键当然会抛出一个关于过滤的错误消息）。

When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").

任何帮助？非常感谢！

从地图数据类型切换到预定义的列名减轻了问题。现在，插入次数似乎保持在每次插入<0.005s。

Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.

核心问题仍然是：
我如何使用map数据类型高效？对于数以千计的插入，只有轻微变化的键，这将是一个有效的方法。

The core question remains: How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.

我的键我使用的数据到地图大多保持不变。我理解datastax文档（不能发布链接由于声誉限制，对不起，但很容易找到）说每个键创建一个额外的列 - 或者它是创建一个每个地图一列吗？这将是...很难相信我。

My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.

推荐答案

我建议你模型你的行有点不同。集合不是很好，使用的情况下，你可能最终会有太多的元素。原因是Cassandra二进制协议中的限制，其使用两个字节来表示集合中的元素的数量。这意味着如果您的集合中有超过2 ^ 16个元素，则大小字段将溢出，即使服务器将所有元素发送回客户端，客户端只能看到 N％2 ^ 16 第一个元素（因此，如果你有2 ^ 16 + 3个元素，它将寻找客户端，就像只有3个元素）。

I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).

如果没有获得这么多元素到你的集合的风险，你可以忽略这个建议。我不认为使用集合会给你更差的性能，我不太确定如何会发生。

If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.

CQL3集合基本上只是一个hack在存储模型（我不是在任何负面意义上的黑客），你可以做一个类似MAP的行，不受上述限制自己限制：

CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:

CREATE TABLE transactions (
  yymmddhh VARCHAR,
  bucket INT,
  created TIMEUUID,
  rId INT,
  key VARCHAR,
  value VARCHAR,
  PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)

b $ b

（注意，我把 rId 和地图键移动到主键，我不知道 rId 是，但我认为这是正确的）

(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)

这有两个缺点，使用MAP：它需要你重新组装地图，当你查询数据（每个映射条目都会返回一行），并且它使用更多的空间，因为C *将插入一些额外的列，但是上方是获取太大的集合没有问题。

This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.

最后，它很大程度上取决于你如何查询你的数据。不优化插入，优化读取。例如：如果你不需要每次读回整个地图，但通常只是从它读取一个或两个键，将键放在分区/行键，而不是每个键有一个单独的分区/行假设键的集合将是固定的，所以你知道要查询什么，因为我说：这取决于你如何查询你的数据）。

In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).

您还在评论中提到，当您将桶数从三（0-2）增加到300（0-299）时，性能会提高。这样做的原因是您在集群中更均匀地分散负载。当你有一个基于时间的分区/行键，比如你的 yymmddhh ，总会有一个热分区，所有的写入都会移动在任何给定时刻它将只击中一个节点）。您使用 bucket 列/单元正确添加了平滑因子，但是只有三个值，至少两个结束在相同物理节点上的可能性太高。有三百人，你会有更好的传播。

You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.

这篇关于使用timeuuid作为列名称将日志数据写入Cassandra时性能不佳的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用timeuuid作为列名称将日志数据写入Cassandra时性能不佳 [英] Bad performance when writing log data to Cassandra with timeuuid as a column name

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用timeuuid作为列名称将日志数据写入Cassandra时性能不佳 [英] Bad performance when writing log data to Cassandra with timeuuid as a column name

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭