如何循环卡桑德拉大表中的星火小块 [英] How to iterate over large Cassandra table in small chunks in Spark

查看:212
本文介绍了如何循环卡桑德拉大表中的星火小块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的测试环境我有1卡桑德拉节点和3个星火节点。我想遍历有200K左右行,每行大约服用20-50KB显然大表。

In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB.

CREATE TABLE foo (
  uid timeuuid,
  events blob,
  PRIMARY KEY ((uid))
) 

下面是在火花集群执行斯卡拉code

Here is scala code that is executed at spark cluster

val rdd = sc.cassandraTable("test", "foo")

// This pulls records in memory, taking ~6.3GB
var count = rdd.select("events").count()  

// Fails nearly immediately with 
// NoHostAvailableException: All host(s) tried for query failed [...]
var events = rdd.select("events").collect()

卡桑德拉2.0.9,星火:1.2.1,星火卡桑德拉连接器-1.2.0-α2

Cassandra 2.0.9, Spark: 1.2.1, Spark-cassandra-connector-1.2.0-alpha2

我试过只运行收集,不用计数 - 在这种情况下,它只是<$ C不能快$ C> NoHostAvailableException 。

I tried to only run collect, without count - in this case it just fails fast with NoHostAvailableException.

问:什么是一次行大表读取和处理小批量遍历正确的做法

Question: what is the correct approach to iterate over large table reading and processing small batch of rows at a time?

推荐答案

有2个设置在卡桑德拉星火连接器来调节块大小(把它们放在SparkConf对象):

There are 2 settings in the Cassandra Spark Connector to adjust the chunk size (put them in the SparkConf object):


  • spark.cassandra.input.split.size:每星火分区的行数(默认100000)

  • spark.cassandra.input.page.row.size:每取页的行数(即网络往返)(缺省为1000)

此外,你不应该使用收集动作在你的例子,因为它会获取在驱动程序的应用程序内存中的所有行,并且可能引发内存溢出异常。您可以使用只有当你肯定知道它会产生少量的行收集的行动。在计数操作则不同,它产生只是一个整数。所以,我劝你还是从卡桑德拉像你这样,处理它加载数据,并将结果(在卡桑德拉,HDFS,等等)。

Furthermore, you shouldn't use the collect action in your example because it will fetch all the rows in the driver application memory and may raise an out of memory exception. You can use the collect action only if you know for sure it will produce a small number of rows. The count action is different, it produce only a integer. So I advise you to load your data from Cassandra like you did, process it, and store the result (in Cassandra, HDFS, whatever).

这篇关于如何循环卡桑德拉大表中的星火小块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆