获取cassandra中的partitionID的所有记录将导致RPC超时 [英] Fetching all the records for a partitionID in cassandra gives RPC timeout

查看:147
本文介绍了获取cassandra中的partitionID的所有记录将导致RPC超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Cassandra 1.2.1,复合键并尝试获取特定partitionID的所有记录。以下是我使用的模式:




  • TimeStamp

  • 设备ID

  • 数据传输

  • 位置ID

  • 设备所有者



主键是复合键:(TimeStamp,Device ID)。因此TimeStamp是分区键。每个记录将为70-80个字节。



有1000个不同的时间戳,对于每个时间戳,有500K个设备ID。所以有5亿条记录,我想获取特定时间戳的所有记录。类似的东西:



从Schema中选择* TimeStamp ='..'



查询应该能够快速获取所有记录,因为相关的行都存储在连续的磁盘位置,这意味着很少的磁盘寻找将给我们的结果。过滤器是在TimeStamp上,这意味着只有一个节点将被查询命中。此外,数据总量是500K * 80字节〜40 MB,这不是一个可怕的很多。但是,当我使用CQL(3)或Astyanax运行时,我得到RPC超时。



我的理解是partitionID的所有记录都在连续的磁盘位置错误?

解决方案

最终,列将在磁盘上彼此接近,因为他们在同一行。但在压缩完成之前(即假设你不运行nodetool compact),他们不会。但是,它们应该分成几个SSTables。



然而,较慢的事情可能是CPU反序列化,比较其他副本的结果,并序列化回客户端。我怀疑你可以在rpc_timeout(默认为10秒)内为500k个对象做。



要做到这一点,你应该浏览结果。
$ b

对于您的第一个查询,执行

  SELECT * from schema where TimeStamp ='..'限制1000 

然后取最后一个设备ID,并最后调用:

  SELECT'last'..''from schema where TimeStamp ='..'limit 1000 

,直到响应中的列少于1000列。


I am using Cassandra 1.2.1, composite key and trying to fetch all the records for a particular partitionID. Following is the schema I'm using:

  • TimeStamp
  • Device ID
  • Data Transfer
  • Location ID
  • Device Owner

The primary Key is a composite key: (TimeStamp, Device ID). Therefore TimeStamp is the Partition key. Each record will be 70-80 bytes.

There are 1000 different TimeStamp, and for each timestamp, there are 500K Device IDs. So there are 500 million records, and I want to fetch all the records for a particular timestamp. Something similar to:

Select * from schema where TimeStamp = '..'

My understanding is that this query should be able to fetch all the records fast, since the relevant rows are all stored in contiguous disk location, which means very few disk seeks will give us the result. The filter is on TimeStamp, which means just one node will be hit with the query. Also, the total amount of data is 500K * 80 bytes ~ 40 MB, which is not an awful lot. However, I'm getting RPC Timeouts when I run this with CQL (3) or Astyanax.

Is my understanding that all the records for a partitionID are in contiguous disk location wrong? What should be the correct way to bulk fetch such a data?

解决方案

Eventually the columns will be close to each other on disk because they are in the same row. But before compaction is complete (i.e. assuming you don't run nodetool compact), they won't be. But they should be split across a few SSTables.

However, the slower thing is probably CPU to deserialize, compare the results from the other replicas and serialize back to the client. I doubt you can do that for 500k objects within rpc_timeout (default is 10 seconds).

To do this, you should page through the result.

For your first query, do

SELECT * from schema where TimeStamp = '..' limit 1000

Then take the last device ID and call it last:

SELECT 'last'..'' from schema where TimeStamp = '..' limit 1000

until you get fewer than 1000 columns in the response.

这篇关于获取cassandra中的partitionID的所有记录将导致RPC超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆