使用 boto3 读取大文件时,PySpark 抛出 java.io.EOFException [英] PySpark throws java.io.EOFException when reading big files with boto3

查看:333
本文介绍了使用 boto3 读取大文件时,PySpark 抛出 java.io.EOFException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 boto3 从 S3 读取文件,这已证明比 sc.textFile(...) 快得多.这些文件大约在 300MB 到 1GB 之间.过程如下:

I'm using boto3 to read files from S3, this have shown to be much faster than sc.textFile(...). These files are between 300MB and 1GB approx. The process goes like:

data = sc.parallelize(list_of_files, numSlices=n_partitions) \
    .flatMap(read_from_s3_and_split_lines)

events = data.aggregateByKey(...)

运行此过程时,出现异常:

When running this process, I get the exception:

15/12/04 10:58:00 WARN TaskSetManager: Lost task 41.3 in stage 0.0 (TID 68, 10.83.25.233): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
    ... 15 more

很多时候,只是一些任务崩溃,而工作能够恢复.但是,有时在出现这些错误后,整个作业会崩溃.我一直无法找到这个问题的根源,似乎出现和消失取决于我阅读的文件数量,我应用的确切转换......读取单个文件时它永远不会失败.

Many times, just some tasks crash and the job is able to recover. However, sometimes the whole job crashes after a number of these errors. I haven't been able to find the origin of this problem and seems to appear and disappear depending on number of files I read, exact transformations I apply... It never fails when reading a single file.

推荐答案

我也遇到过类似的问题,我的调查显示问题是 Python 进程的空闲内存不足.Spark 占用了所有内存,而 Python 进程(PySpark 工作的地方)已经崩溃.

I have encountered similar problem, my investigation showed that the problem was the lack of free memory for Python process. Spark has took all the memory and Python process (where PySpark works) has been crashing.

一些建议:

  1. 给机器添加一些内存,
  2. 取消持久化不需要的 RDD,
  3. 更明智地管理内存(添加一些对 Spark 内存使用的限制).

这篇关于使用 boto3 读取大文件时,PySpark 抛出 java.io.EOFException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆