Spark Worker在临时随机播放文件上抛出FileNotFoundException [英] Spark worker throws FileNotFoundException on temporary shuffle files

查看:118
本文介绍了Spark Worker在临时随机播放文件上抛出FileNotFoundException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个Spark应用程序,该应用程序处理多组数据点;其中一些集合需要顺序处理.当为少量数据点(大约100个)运行应用程序时,一切正常.但是在某些情况下,集合的大小约为ca.10,000个数据点,这些数据点导致工作程序因以下堆栈跟踪而崩溃:

I am running a Spark application that processes multiple sets of data points; some of these sets need to be processed sequentially. When running the application for small sets of data points (ca. 100), everything works fine. But in some cases, the sets will have a size of ca. 10,000 data points, and those cause the worker to crash with the following stack trace:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 36, 10.40.98.10, executor 1): java.io.FileNotFoundException: /tmp/spark-5198d746-6501-4c4d-bb1c-82479d5fd48f/executor-a1d76cc1-a3eb-4147-b73b-29742cfd652d/blockmgr-d2c5371b-1860-4d8b-89ce-0b60a79fa394/3a/temp_shuffle_94d136c9-4dc4-439e-90bc-58b18742011c (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

在多次出现此错误实例后,我已经检查了所有日志文件,但没有找到其他任何错误消息.

I have checked all log files after multiple instances of this error, but did not find any other error messages.

在互联网上搜索此问题,我发现了两个可能不适用于我的情况的潜在原因:

Searching the internet for this problem, I have found two potential causes that do not seem to be applicable to my situation:

  • 运行Spark进程的用户在/tmp/目录中没有读/写权限.
    • 看到错误仅发生在较大的数据集上(而不是总是发生),所以我不认为这是问题所在.
    • The user running the Spark process does not have read/write permission in the /tmp/ directory.
      • Seeing as the error occurs only for larger datasets (instead of always), I do not expect this to be the problem.
      • 我的系统上的/tmp/目录大约有45GB的可用空间,并且单个数据点中的数据量(小于1KB)意味着情况可能并非如此.
      • li>
      • The /tmp/ directory on my system has about 45GB available, and the amount of data in a single data point (< 1KB) means that this is also probably not the case.

      这个问题困扰了我几个小时,试图找到解决方法和可能的原因.

      I have been flailing at this problem for a couple of hours, trying to find work-arounds and possible causes.

      • 我曾尝试将群集(通常是两台计算机)减少为一个与该驱动程序在同一台计算机上运行的单一工作程序,以期这样可以消除对改组的需求,从而防止出现此错误.这行不通;错误的发生方式完全相同.
      • 我已将问题隔离到通过尾递归方法顺序处理数据集的操作中.

      是什么导致此问题?我该如何自行确定原因?

      What is causing this problem? How can I go about determining the cause myself?

      推荐答案

      问题原来是在工作程序上发生的堆栈溢出(ha!).

      The problem turns out to be a stack overflow (ha!) occurring on the worker.

      我直觉地重写了要完全在驱动程序上执行的操作(有效地禁用了Spark功能).当我运行此代码时,系统仍然崩溃,但现在显示了 StackOverflowError .与我以前所相信的相反,尾部递归方法显然可以像其他任何形式的递归一样导致堆栈溢出.将方法重写为不再使用递归后,问题消失了.

      On a hunch, I rewrote the operation to be performed entirely on the driver (effectively disabling Spark functionality). When I ran this code, the system still crashed, but now displayed a StackOverflowError. Contrary to what I previously believed, apparently tail-recursive methods can definitely cause a stack overflow just like any other form of recursion. After rewriting the method to no longer use recursion, the problem disappeared.

      堆栈溢出可能不是产生原始FileNotFoundException的唯一问题,但是进行临时代码更改以将操作拖给驱动程序似乎是确定问题实际原因的好方法.

      A stack overflow is probably not the only problem that can produce the original FileNotFoundException, but making a temporary code change which pulls the operation to the driver seems to be a good way to determine the actual cause of the problem.

      这篇关于Spark Worker在临时随机播放文件上抛出FileNotFoundException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆