如何修复“错误打开块StreamChunkId"?在外部火花洗牌服务上 [英] How to fix "Error opening block StreamChunkId" on external spark shuffle service
问题描述
我正在尝试从Kubernetes集群中的Zeppelin部署中运行Spark作业.我也有在不同名称空间上运行的Spark Shuffle服务(守护程序-v2.2.0-k8s).这是我的火花配置(在齐柏林飞艇上设置)
I'm trying to run spark jobs from my zeppelin deployment in a kubernetes cluster. I have a spark shuffle service (daemonset - v2.2.0-k8s) running on a different namespace as well. Here are my spark configs (set on zeppelin pod)
--conf spark.kubernetes.executor.docker.image=<spark-executor>
--conf spark.executor.cores=5
--conf spark.driver.memory=5g
--conf spark.executor.memory=5g
--conf spark.kubernetes.authenticate.driver.serviceAccountName=<svc-account>
--conf spark.local.dir=/tmp/spark-local
--conf spark.executor.instances=5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.kubernetes.shuffle.labels="app=spark-shuffle,spark-version=2.2.0"
--conf spark.dynamicAllocation.maxExecutors=5
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.kubernetes.shuffle.namespace=<namespace>
--conf spark.kubernetes.docker.image.pullPolicy=IfNotPresent
--conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.2.0-kubernetes-0.5.0
--conf spark.kubernetes.resourceStagingServer.uri=<ip:port>
但是我从齐柏林飞艇产生的外部spark-shuffle和spark执行器获得以下日志:
But I get the following logs from external spark-shuffle and spark executors spawned by zeppelin:
+ /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.k8s.KubernetesExternalShuffleService 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/sl
f4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/kubernetes-client-3.0.1.jar!/org
/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2020-01-14 03:37:31 INFO ExternalShuffleService:2574 - Started daemon with proces
s name: 10@unawa2-shuffle-unawa2-spark-shuffle-d5cfg
2020-01-14 03:37:31 INFO SignalUtils:54 - Registered signal handler for TERM
2020-01-14 03:37:31 INFO SignalUtils:54 - Registered signal handler for HUP
2020-01-14 03:37:31 INFO SignalUtils:54 - Registered signal handler for INT
2020-01-14 03:37:31 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-14 03:37:31 INFO SecurityManager:54 - Changing view acls to: root
2020-01-14 03:37:31 INFO SecurityManager:54 - Changing modify acls to: root
2020-01-14 03:37:31 INFO SecurityManager:54 - Changing view acls groups to:
2020-01-14 03:37:31 INFO SecurityManager:54 - Changing modify acls groups to:
2020-01-14 03:37:31 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2020-01-14 03:37:32 INFO KubernetesExternalShuffleService:54 - Starting shuffle service on port 7337 (auth enabled = false)
2020-01-14 03:38:35 INFO KubernetesShuffleBlockHandler:54 - Received registration request from app spark-application-1578973110574 (remote address /192.168.2.37:40318).
2020-01-14 03:38:36 INFO ExternalShuffleBlockResolver:135 - Registered executor AppExecId{appId=spark-application-1578973110574, execId=5} with ExecutorShuffleInfo{localDirs=[/tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
2020-01-14 03:39:15 ERROR TransportRequestHandler:127 - Error opening block StreamChunkId{streamId=527834012000, chunkIndex=0} for request from /192.168.3.130:50896
java.lang.RuntimeException: Failed to open file: /tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744/0f/shuffle_1_0_0.index
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:249)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:174)
at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler$1.next(ExternalShuffleBlockHandler.java:105)
at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler$1.next(ExternalShuffleBlockHandler.java:95)
at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:89)
at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:125)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
.
.
.
Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException: /tmp/spark-local/blockmgr-8a26a714-3ecb-46dd-8499-ff796fa97744/0f/shuffle_1_0_0.index (No such file or directory)
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
有什么办法解决这个问题吗?
Any idea how to fix this?
我将本地目录/tmp/spark-local安装到了pod中.当我在每个节点中ssh
时,我确认了块管理器存在于一个工作节点中(我猜这是预期的行为).当来自另一个工作程序节点的混洗容器之一尝试访问同一块管理器时,将发生错误.
I mounted the local dir /tmp/spark-local into my pods. When I ssh
into each node, I confirmed that the block manager exists in one of the worker nodes (I'm guessing this is the expected behavior). The error occurs when one of the shuffle pods from another worker node tries to access the same block manager.
推荐答案
注释线程之外的摘要.
要在启用了动态分配的Kubernetes上运行Spark,您可以:
In order to run Spark on Kubernetes with dynamic allocation enabled you can:
重要说明:
- 您应基于 Apache Spark 2.2.0
- 该功能是实验性的,不在支持范围之内
- You should base your images on kubespark images, which are built with the forked Apache Spark 2.2.0
- The feature is experimental and out of the support
这篇关于如何修复“错误打开块StreamChunkId"?在外部火花洗牌服务上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!