如何循环卡桑德拉大表中的星火小块 [英] How to iterate over large Cassandra table in small chunks in Spark

查看：212 发布时间：2016/5/22 16:28:46 scala cassandra apache-spark rdd

本文介绍了如何循环卡桑德拉大表中的星火小块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的测试环境我有1卡桑德拉节点和3个星火节点。我想遍历有200K左右行，每行大约服用20-50KB显然大表。

In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB.

CREATE TABLE foo (
  uid timeuuid,
  events blob,
  PRIMARY KEY ((uid))
)

下面是在火花集群执行斯卡拉code

Here is scala code that is executed at spark cluster

val rdd = sc.cassandraTable("test", "foo")

// This pulls records in memory, taking ~6.3GB
var count = rdd.select("events").count()  

// Fails nearly immediately with 
// NoHostAvailableException: All host(s) tried for query failed [...]
var events = rdd.select("events").collect()

卡桑德拉2.0.9，星火：1.2.1，星火卡桑德拉连接器-1.2.0-α2

Cassandra 2.0.9, Spark: 1.2.1, Spark-cassandra-connector-1.2.0-alpha2

我试过只运行收集，不用计数 - 在这种情况下，它只是<$ C不能快$ C> NoHostAvailableException 。

I tried to only run collect, without count - in this case it just fails fast with NoHostAvailableException.

问：什么是一次行大表读取和处理小批量遍历正确的做法

Question: what is the correct approach to iterate over large table reading and processing small batch of rows at a time?

如何循环卡桑德拉大表中的星火小块 [英] How to iterate over large Cassandra table in small chunks in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何循环卡桑德拉大表中的星火小块 [英] How to iterate over large Cassandra table in small chunks in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭