java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0 [英] java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
问题描述
我使用以下命令在本地模式下使用Spark 2.0调用Pyspark:
pyspark --executor-memory 4g --driver-memory 4g
输入数据帧正在从tsv文件中读取,并具有580 K x 28列。我在数据框上做了一些操作,然后我试图将它导出到tsv文件,我得到这个错误。
df.coalesce(1).write.save(sample.tsv,format =csv,header ='true',delimiter ='\t')
任何指针如何摆脱这个错误。我可以很容易地显示df或计数行。
输出数据框为3100行,23列
错误:
由于阶段失败而导致作业中止:阶段70.0中的任务0失败1次,最近失败:失去的任务0.0 in stage 70.0(TID 1073,localhost):org.apache.spark.SparkException:在org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)$中写入行
时失败b $ b at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache .spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask。 runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util。 concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)$ b $在java.lang.Thread.run(Thread.java:745)
引起:java.lang.OutOfMemoryError:无法获取100内存的字节,在org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)处获得0
,在org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter处获得
。 acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
位于org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
位于org.apache.spark.sql。执行.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.sort_addToSorter $(Unknown Source)
at org.apache.spark。 sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark .sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15 $$ anon $ 1.fetchNextRow( WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15 $$ anon $ 1.< init>(WindowExec.scala:309)
at org.apache .spark.sql.execution.WindowExec $$ anonfun $ 15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD $ $ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38)
at org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
在org .apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark .rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)org.apache.spark.rdd.RDD.iterator上的
(RDD.scala:283)org.apache.spark.rdd.ZippedPartitionsRDD2上的
.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD .scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319 )
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
在org.apache.spark.rdd.CoalescedRDD $$ anonfun $ compute $ 1.apply(CoalescedRDD.scala:96)
在org.apache.spark.rdd.CoMatercedRDD $$ anonfun $ compute $ 1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)
at scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)
at org.apache .spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
在org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:25 8)
... 8 more
驱动程序堆栈跟踪:
我的问题确实是 coalesce()
。
我做的是不使用 coalesce()
导出文件,而是使用 df.write.parquet(testP)
。然后回读文件并用 coalesce(1)
。
输出。希望它也适用于您。
I'm invoking Pyspark with Spark 2.0 in local mode with the following command:
pyspark --executor-memory 4g --driver-memory 4g
The input dataframe is being read from a tsv file and has 580 K x 28 columns. I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.
df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')
Any pointers how to get rid of this error. I can easily display the df or count the rows.
The output dataframe is 3100 rows with 23 columns
Error:
Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Driver stacktrace:
The problem for me was indeed coalesce()
.
What I did was exporting the file not using coalesce()
but parquet instead using df.write.parquet("testP")
. Then read back the file and export that with coalesce(1)
.
Hopefully it works for you as well.
这篇关于java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!