Spark 加入失败 Spark 作业总是无法加入(CDH 5.5.2、Spark 1.5.0) [英] Spark joins failing Spark job always failing for joins (CDH 5.5.2, Spark 1.5.0)

查看:42
本文介绍了Spark 加入失败 Spark 作业总是无法加入(CDH 5.5.2、Spark 1.5.0)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在使用新安装的 CDH 5.5.2 集群的 spark 独立集群时经常遇到错误.我们有 7 个工作节点,每个节点有 16 GB 内存.但是,几乎所有的连接都失败了.

We are running into frequent errors with spark standalone cluster with our newly installed CDH 5.5.2 cluster. We have 7 worker nodes each one has 16 GB memory. But, almost all joins are failing.

我确保我使用 --executor-memory 分配了完整内存,并确保它分配了那么多内存,并在 Spark UI 中进行了验证.

I have made sure i allocated full memory with --executor-memory and ensured it has allocated that much memory and by verifying it in Spark UI.

我们的大部分错误如下.我们已经检查了我们这边的情况.但是我们的解决方案都没有奏效.

Most of our errors are as below. We have checked things from our side. But none of our solutions did work.

Caused by: java.io.FileNotFoundException: /tmp/spark-b9e69c8d-153b-4c4b-88b1-ac779c060de5/executor-44e88b75-5e79-4d96-b507-ddabcab30e1b/blockmgr-cd27625c-8716-49ac-903d-9d5c36cf2622/29/shuffle_1_66_0.index (Permission denied)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:275)
... 27 more
at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more

  1. /tmp 有 777 个权限,但它仍然告诉我们/tmp 没有权限.

  1. /tmp has 777 permissions, but it is still telling as /tmp has no permssions.

我们已经将 SPARK_LOCAL_DIRS 配置到其他一些我们有更好磁盘内存的文件夹,但集群仍然使用/tmp,为什么.?我们已经通过Cloudera manager改了,在spark的spark配置中打印了spark.local.dirs,里面给出了我们设置的文件夹.但是,当谈到执行时,它是另一种方式.它正在检查/tmp 中的文件.我们在这里遗漏了什么吗?

We have configured SPARK_LOCAL_DIRS to some other folder where we have better disk memory, but still the cluster is using /tmp, why.? we have changed it through Cloudera manager, and printed the spark.local.dirs in spark configuration in spark, which gives the folder that we set. But, when it comes to execution, it is other way. It is checking the files in /tmp. Are we missing any thing here.?

我们已关闭 spark-yarn,纱线的任何配置是否会影响独立?

we have turned off spark-yarn, does any configurations of yarn impacting standalone?

有人遇到过这个问题吗?为什么这对我们反复出现.?我们与 hortonworks 有类似的集群,在那里我们安装了基本的 spark(这不是发行版的一部分),效果很好.但是,在我们的新集群中,我们正面临这个问题.可能是我们错过了一些东西.?但很想知道我们错过了什么.

Has any one faced this issue? and why is this recurring to us.? We had similar cluster with horton works, where we installed bare-bones spark ( which is not part of distribution), which worked very well. But, in our new cluster, we are facing this issues. May be we might have missed some things.? but curious to know what we missed.

推荐答案

这项工作适合我

在所有节点上

sudo chmod -R 0777/tmp须藤 chmod +t/tmp

sudo chmod -R 0777 /tmp sudo chmod +t /tmp

使用并行 ssh

sudo parallel-ssh -h hosts.txt -l ubuntu --timeout=0 'sudo chmod -R 0777/tmp'

sudo parallel-ssh -h hosts.txt -l ubuntu --timeout=0 'sudo chmod -R 0777 /tmp'

sudo parallel-ssh -h hosts.txt -l ubuntu --timeout=0 'sudo chmod +t/tmp'

sudo parallel-ssh -h hosts.txt -l ubuntu --timeout=0 'sudo chmod +t /tmp'

这篇关于Spark 加入失败 Spark 作业总是无法加入(CDH 5.5.2、Spark 1.5.0)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆