使用 boto3 读取大文件时，PySpark 抛出 java.io.EOFException [英] PySpark throws java.io.EOFException when reading big files with boto3

查看：333 发布时间：2021/6/24 20:39:13 amazon-s3 pyspark eofexception boto3

本文介绍了使用 boto3 读取大文件时，PySpark 抛出 java.io.EOFException的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 boto3 从 S3 读取文件，这已证明比 sc.textFile(...) 快得多.这些文件大约在 300MB 到 1GB 之间.过程如下:

I'm using boto3 to read files from S3, this have shown to be much faster than sc.textFile(...). These files are between 300MB and 1GB approx. The process goes like:

data = sc.parallelize(list_of_files, numSlices=n_partitions) \
    .flatMap(read_from_s3_and_split_lines)

events = data.aggregateByKey(...)

运行此过程时，出现异常:

When running this process, I get the exception:

15/12/04 10:58:00 WARN TaskSetManager: Lost task 41.3 in stage 0.0 (TID 68, 10.83.25.233): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
    ... 15 more

很多时候，只是一些任务崩溃，而工作能够恢复.但是，有时在出现这些错误后，整个作业会崩溃.我一直无法找到这个问题的根源，似乎出现和消失取决于我阅读的文件数量，我应用的确切转换......读取单个文件时它永远不会失败.

Many times, just some tasks crash and the job is able to recover. However, sometimes the whole job crashes after a number of these errors. I haven't been able to find the origin of this problem and seems to appear and disappear depending on number of files I read, exact transformations I apply... It never fails when reading a single file.

使用 boto3 读取大文件时，PySpark 抛出 java.io.EOFException [英] PySpark throws java.io.EOFException when reading big files with boto3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 boto3 读取大文件时，PySpark 抛出 java.io.EOFException [英] PySpark throws java.io.EOFException when reading big files with boto3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭