阿帕奇星火无法处理大量卡桑德拉列族 [英] Apache Spark fails to process a large Cassandra column family

查看：247 发布时间：2016/5/22 16:17:51 java cassandra apache-spark apache-spark-sql spark-cassandra-connector

本文介绍了阿帕奇星火无法处理大量卡桑德拉列族的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用Apache星火处理我的大（〜230K项）卡桑德拉数据集，但我一直运行到不同类型的错误。但是对数据集〜200项运行时，我可以成功运行应用程序。我有1个主及工人2 3个节点的火花设置，以及2工人也有与2的复制因子我2火花工人展示2.4和Web界面上2.8 GB的内存和索引的数据装卡桑德拉集群运行应用程序时，得到4.7 GB的内存组合我设置 spark.executor.memory 至2409。这里是我的WebUI中首页

I am trying to use Apache Spark to process my large (~230k entries) cassandra dataset, but I am constantly running into different kinds of errors. However I can successfully run applications when running on a dataset ~200 entries. I have a spark setup of 3 nodes with 1 master and 2 workers, and the 2 workers also have a cassandra cluster installed with data indexed with a replication factor of 2. My 2 spark workers show 2.4 and 2.8 GB memory on the web interface and I set spark.executor.memory to 2409 when running an application, to get a combined memory of 4.7 GB. Here is my WebUI Homepage

的任务之一的环境页

在这个阶段，我只是尝试处理使用火花存储在Cassandra数据。这是基本的code我使用要做到这一点在Java中

At this stage, I am simply trying to process data stored in cassandra using spark. Here is the basic code I am using to do this in Java

SparkConf conf = new SparkConf(true)
        .set("spark.cassandra.connection.host", CASSANDRA_HOST)
        .setJars(jars);

SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SparkContextJavaFunctions context = javaFunctions(sc);

CassandraJavaRDD<CassandraRow> rdd = context.cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);

System.out.println(rdd.count());

有关成功运行，在一个小的数据集（200项），事件的接口看起来像这样

For a successful run, on a small dataset (200 entries), the events interface looks something like this

但是，当我在一个大的数据集运行相同的事情（即我只改变 CASSANDRA_COLUMN_FAMILY ），作业从来没有在终端内部终止，日志看起来像这样

But when I run the same thing on a large dataset (i.e. I change only the CASSANDRA_COLUMN_FAMILY), the job never terminates inside the terminal, the log looks like this

和〜2分钟后，该遗嘱执行人的标准错误看起来像这样

and after ~2 minutes, the stderr for the executors looks like this

和〜7分钟后，我得到

异常线程mainjava.lang.OutOfMemoryError：GC开销超过限制

在我的终端，我必须手动杀 SparkSubmit 过程。然而，大数据集是从只占据22 MB，做一个二进制文件索引 nodetool状态，我可以看到，只有约115 MB的数据存储在我的两个卡桑德拉节点。我也曾尝试使用星火SQL我的数据，但与得到了类似的结果了。我在哪里我的设置，我应该做的成功处理我的数据集，为转型行动方案，并使用SQL星火计划什么走错了。

in my terminal, and I have to manually kill the SparkSubmit process. However, the large dataset was indexed from a binary file that occupied only 22 MB, and doing nodetool status, I can see that only ~115 MB data is stored in both of my cassandra nodes. I have also tried to use Spark SQL on my dataset, but have got similar results with that too. Where am I going wrong with my setup, and what should I do to successfully process my dataset, for both a Transformation-Action program and a program that uses Spark SQL.

我已经尝试以下方法

使用 -Xms1G -Xmx1G 来增加内存，但该计划失败，一个异常说我应该改为设置 spark.executor 。记忆体，我有。

Using -Xms1G -Xmx1G to increase memory, but the program fails with an exception saying that I should instead set spark.executor.memory, which I have.

使用 spark.cassandra.input.split.size ，从而未能说这不是一个有效的选项，类似的选择是 spark.cassandra.input.split.size_in_mb ，这是我设置为1，有没有效果。

Using spark.cassandra.input.split.size, which fails saying it isn't a valid option, and a similar option is spark.cassandra.input.split.size_in_mb, which I set to 1, with no effect.

修改

，我也曾尝试以下方法：

based on this answer, I have also tried the following methods:

设置 spark.storage.memoryFraction 0

没有设定 spark.storage.memoryFraction 为零，并使用坚持与 MEMORY_ONLY ， MEMORY_ONLY_SER ， MEMORY_AND_DISK 和 MEMORY_AND_DISK_SER 。

not set spark.storage.memoryFraction to zero and use persist with MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK and MEMORY_AND_DISK_SER.

版本：

星火：1.4.0

Spark: 1.4.0

卡桑德拉：2.1.6

Cassandra: 2.1.6

火花卡桑德拉连接器：1.4.0-M1

spark-cassandra-connector: 1.4.0-M1

阿帕奇星火无法处理大量卡桑德拉列族 [英] Apache Spark fails to process a large Cassandra column family

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

阿帕奇星火无法处理大量卡桑德拉列族 [英] Apache Spark fails to process a large Cassandra column family

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭