如何修复“错误打开块StreamChunkId"?在外部火花洗牌服务上 [英] How to fix "Error opening block StreamChunkId" on external spark shuffle service

查看:490
本文介绍了如何修复“错误打开块StreamChunkId"?在外部火花洗牌服务上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Kubernetes集群中的Zeppelin部署中运行Spark作业.我也有在不同名称空间上运行的Spark Shuffle服务(守护程序-v2.2.0-k8s).这是我的火花配置(在齐柏林飞艇上设置)

I'm trying to run spark jobs from my zeppelin deployment in a kubernetes cluster. I have a spark shuffle service (daemonset - v2.2.0-k8s) running on a different namespace as well. Here are my spark configs (set on zeppelin pod)

--conf spark.kubernetes.executor.docker.image=<spark-executor> 
--conf spark.executor.cores=5
--conf spark.driver.memory=5g
--conf spark.executor.memory=5g
--conf spark.kubernetes.authenticate.driver.serviceAccountName=<svc-account> 
--conf spark.local.dir=/tmp/spark-local 
--conf spark.executor.instances=5 
--conf spark.dynamicAllocation.enabled=true 
--conf spark.shuffle.service.enabled=true 
--conf spark.kubernetes.shuffle.labels="app=spark-shuffle,spark-version=2.2.0" 
--conf spark.dynamicAllocation.maxExecutors=5   
--conf spark.dynamicAllocation.minExecutors=1 
--conf spark.kubernetes.shuffle.namespace=<namespace> 
--conf spark.kubernetes.docker.image.pullPolicy=IfNotPresent 
--conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.2.0-kubernetes-0.5.0 
--conf spark.kubernetes.resourceStagingServer.uri=<ip:port>

但是我从齐柏林飞艇产生的外部spark-shuffle和spark执行器获得以下日志:

But I get the following logs from external spark-shuffle and spark executors spawned by zeppelin:

+ /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.k8s.KubernetesExternalShuffleService 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/sl
f4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/kubernetes-client-3.0.1.jar!/org
/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2020-01-14 03:37:31 INFO  ExternalShuffleService:2574 - Started daemon with proces
s name: 10@unawa2-shuffle-unawa2-spark-shuffle-d5cfg
2020-01-14 03:37:31 INFO  SignalUtils:54 - Registered signal handler for TERM
2020-01-14 03:37:31 INFO  SignalUtils:54 - Registered signal handler for HUP
2020-01-14 03:37:31 INFO  SignalUtils:54 - Registered signal handler for INT
2020-01-14 03:37:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-14 03:37:31 INFO  SecurityManager:54 - Changing view acls to: root
2020-01-14 03:37:31 INFO  SecurityManager:54 - Changing modify acls to: root
2020-01-14 03:37:31 INFO  SecurityManager:54 - Changing view acls groups to:
2020-01-14 03:37:31 INFO  SecurityManager:54 - Changing modify acls groups to:
2020-01-14 03:37:31 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2020-01-14 03:37:32 INFO  KubernetesExternalShuffleService:54 - Starting shuffle service on port 7337 (auth enabled = false)
2020-01-14 03:38:35 INFO  KubernetesShuffleBlockHandler:54 - Received registration request from app spark-application-1578973110574 (remote address /192.168.2.37:40318).
2020-01-14 03:38:36 INFO  ExternalShuffleBlockResolver:135 - Registered executor AppExecId{appId=spark-application-1578973110574, execId=5} with ExecutorShuffleInfo{localDirs=[/tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
2020-01-14 03:39:15 ERROR TransportRequestHandler:127 - Error opening block StreamChunkId{streamId=527834012000, chunkIndex=0} for request from /192.168.3.130:50896
java.lang.RuntimeException: Failed to open file: /tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744/0f/shuffle_1_0_0.index
        at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:249)
        at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:174)
        at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler$1.next(ExternalShuffleBlockHandler.java:105)
        at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler$1.next(ExternalShuffleBlockHandler.java:95)
        at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:89)
        at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:125)
        at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
.
.
.
Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException: /tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744/0f/shuffle_1_0_0.index (No such file or directory)
        at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
        at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
        at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
        at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)

有什么办法解决这个问题吗?

Any idea how to fix this?

我将本地目录/tmp/spark-local安装到了pod中.当我在每个节点中ssh时,我确认了块管理器存在于一个工作节点中(我猜这是预期的行为).当来自另一个工作程序节点的混洗容器之一尝试访问同一块管理器时,将发生错误.

I mounted the local dir /tmp/spark-local into my pods. When I ssh into each node, I confirmed that the block manager exists in one of the worker nodes (I'm guessing this is the expected behavior). The error occurs when one of the shuffle pods from another worker node tries to access the same block manager.

推荐答案

注释线程之外的摘要.

要在启用了动态分配的Kubernetes上运行Spark,您可以:

In order to run Spark on Kubernetes with dynamic allocation enabled you can:

重要说明:

  • 您应基于
  • You should base your images on kubespark images, which are built with the forked Apache Spark 2.2.0
  • The feature is experimental and out of the support

这篇关于如何修复“错误打开块StreamChunkId"?在外部火花洗牌服务上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆