当使用S3数据帧调用.show()和.toPandas()时,触发Spark org.apache.http.ConnectionClosedException [英] Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe
本文介绍了当使用S3数据帧调用.show()和.toPandas()时,触发Spark org.apache.http.ConnectionClosedException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在AWS S3上使用Parquet数据创建了一个PySpark DataFrame df
.调用 df.count()
可以,但是 df.show()
或 df.toPandas()
失败,并显示以下错误:
I created a PySpark DataFrame df
with Parquet data on AWS S3. Calling df.count()
works, but df.show()
or df.toPandas()
fails with the following error:
Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0
failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97,
executor driver): org.apache.http.ConnectionClosedException: Premature end of Content-
Length delimited message body (expected: 77,826,675; received: 8,192)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:166)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at com.amazonaws.services.s3.model.S3ObjectInputStream.abort(S3ObjectInputStream.java:90)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
at org.apache.hadoop.fs.s3a.S3AInputStream.seek(S3AInputStream.java:115)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at org.apache.parquet.hadoop.util.H1SeekableInputStream.seek(H1SeekableInputStream.java:46)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我在下面回答了我自己的问题:Java更新似乎是造成此问题的原因.但是任何人都可以解决这个问题而不必降级Java?
I have answered my own question below: Java update seems to be responsible for the problem. But can anyone solve this without downgrading Java?
推荐答案
这是在最新的Java更新(1.8.0_272)之后发生的.恢复为1.8.0_265将解决此问题.
This happens after the latest Java update (1.8.0_272). Reverting to 1.8.0_265 will solve the problem.
这篇关于当使用S3数据帧调用.show()和.toPandas()时,触发Spark org.apache.http.ConnectionClosedException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文