阿帕奇星火无法处理大量卡桑德拉列族 [英] Apache Spark fails to process a large Cassandra column family

查看:247
本文介绍了阿帕奇星火无法处理大量卡桑德拉列族的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用Apache星火处理我的大(〜230K项)卡桑德拉数据集,但我一直运行到不同类型的错误。但是对数据集〜200项运行时,我可以成功运行应用程序。我有1个主及工人2 3个节点的火花设置,以及2工人也有与2的复制因子我2火花工人展示2.4和Web界面上2.8 GB的内存和索引的数据装卡桑德拉集群运行应用程序时,得到4.7 GB的内存组合我设置 spark.executor.memory 至2409。这里是我的WebUI中首页

I am trying to use Apache Spark to process my large (~230k entries) cassandra dataset, but I am constantly running into different kinds of errors. However I can successfully run applications when running on a dataset ~200 entries. I have a spark setup of 3 nodes with 1 master and 2 workers, and the 2 workers also have a cassandra cluster installed with data indexed with a replication factor of 2. My 2 spark workers show 2.4 and 2.8 GB memory on the web interface and I set spark.executor.memory to 2409 when running an application, to get a combined memory of 4.7 GB. Here is my WebUI Homepage

在这里输入的形象描述

的任务之一的环境页

环境

在这个阶段,我只是尝试处理使用火花存储在Cassandra数据。这是基本的code我使用要做到这一点在Java中

At this stage, I am simply trying to process data stored in cassandra using spark. Here is the basic code I am using to do this in Java

SparkConf conf = new SparkConf(true)
        .set("spark.cassandra.connection.host", CASSANDRA_HOST)
        .setJars(jars);

SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SparkContextJavaFunctions context = javaFunctions(sc);

CassandraJavaRDD<CassandraRow> rdd = context.cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);

System.out.println(rdd.count());

有关成功运行,在一个小的数据集(200项),事件的接口看起来像这样

For a successful run, on a small dataset (200 entries), the events interface looks something like this

在这里输入的形象描述

但是,当我在一个大的数据集运行相同的事情(即我只改变 CASSANDRA_COLUMN_FAMILY ),作业从来没有在终端内部终止,日志看起来像这样

But when I run the same thing on a large dataset (i.e. I change only the CASSANDRA_COLUMN_FAMILY), the job never terminates inside the terminal, the log looks like this

在这里输入的形象描述

和〜2分钟后,该遗嘱执行人的标准错误看起来像这样

and after ~2 minutes, the stderr for the executors looks like this

在这里输入的形象描述

和〜7分钟后,我得到

异常线程mainjava.lang.OutOfMemoryError:GC开销超过限制

在我的终端,我必须手动杀 SparkSubmit 过程。然而,大数据集是从只占据22 MB,做一个二进制文件索引 nodetool状态,我可以看到,只有约115 MB的数据存储在我的两个卡桑德拉节点。我也曾尝试使用星火SQL我的数据,但与得到了类似的结果了。我在哪里我的设置,我应该做的成功处理我的数据集,为转型行动方案,并使用SQL星火计划什么走错了。

in my terminal, and I have to manually kill the SparkSubmit process. However, the large dataset was indexed from a binary file that occupied only 22 MB, and doing nodetool status, I can see that only ~115 MB data is stored in both of my cassandra nodes. I have also tried to use Spark SQL on my dataset, but have got similar results with that too. Where am I going wrong with my setup, and what should I do to successfully process my dataset, for both a Transformation-Action program and a program that uses Spark SQL.

我已经尝试以下方法


  • 使用 -Xms1G -Xmx1G 来增加内存,但该计划失败,一个异常说我应该改为设置 spark.executor 。记忆体,我有。

  • Using -Xms1G -Xmx1G to increase memory, but the program fails with an exception saying that I should instead set spark.executor.memory, which I have.

使用 spark.cassandra.input.split.size ,从而未能说这不是一个有效的选项,类似的选择是 spark.cassandra.input.split.size_in_mb ,这是我设置为1,有没有效果。

Using spark.cassandra.input.split.size, which fails saying it isn't a valid option, and a similar option is spark.cassandra.input.split.size_in_mb, which I set to 1, with no effect.

修改

,我也曾尝试以下方法:

based on this answer, I have also tried the following methods:


  • 设置 spark.storage.memoryFraction 0

没有设定 spark.storage.memoryFraction 为零,并使用坚持 MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_SER

not set spark.storage.memoryFraction to zero and use persist with MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK and MEMORY_AND_DISK_SER.

版本:


  • 星火:1.4.0

  • Spark: 1.4.0

卡桑德拉:2.1.6

Cassandra: 2.1.6

火花卡桑德拉连接器:1.4.0-M1

spark-cassandra-connector: 1.4.0-M1

推荐答案

我认为这是在最新的火花卡桑德拉 - 连接器的问题。参数 spark.cassandra.input.split.size_in_mb 应该有64 MB的情况下正PTED作为$ C $ 64个字节间$ P $默认值C。这导致创建分区太多,这是不能由火花进行调度。尝试通过conf值设置为

I think there is a issue in the latest spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb is supposed to have a default value of 64 MB which is being interpreted as 64 bytes in the code. This causes too many partitions to be created, which can't be scheduled by spark. Try setting the conf value to

spark.cassandra.input.split.size_in_mb=67108864

这篇关于阿帕奇星火无法处理大量卡桑德拉列族的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆