java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0 [英] java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0

查看:197
本文介绍了java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下命令在本地模式下使用Spark 2.0调用Pyspark:

  pyspark --executor-memory 4g --driver-memory 4g 

输入数据帧正在从tsv文件中读取,并具有5​​80 K x 28列。我在数据框上做了一些操作,然后我试图将它导出到tsv文件,我得到这个错误。

  df.coalesce(1).write.save(sample.tsv,format =csv,header ='true',delimiter ='\t')

任何指针如何摆脱这个错误。我可以很容易地显示df或计数行。

输出数据框为3100行,23列

错误:

 由于阶段失败而导致作业中止:阶段70.0中的任务0失败1次,最近失败:失去的任务0.0 in stage 70.0(TID 1073,localhost):org.apache.spark.SparkException:在org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)$中写入行
时失败b $ b at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache .spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask。 runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util。 concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)$ b $在java.lang.Thread.run(Thread.java:745)
引起:java.lang.OutOfMemoryError:无法获取100内存的字节,在org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)处获得0
,在org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter处获得
。 acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
位于org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
位于org.apache.spark.sql。执行.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.sort_addToSorter $(Unknown Source)
at org.apache.spark。 sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark .sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15 $$ anon $ 1.fetchNextRow( WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15 $$ anon $ 1.< init>(WindowExec.scala:309)
at org.apache .spark.sql.execution.WindowExec $$ anonfun $ 15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec $$ anonfun $ 15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD $ $ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38)
at org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
在org .apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark .rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)org.apache.spark.rdd.RDD.iterator上的
(RDD.scala:283)org.apache.spark.rdd.ZippedPartitionsRDD2上的
.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD .scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319 )
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
在org.apache.spark.rdd.CoalescedRDD $$ anonfun $ compute $ 1.apply(CoalescedRDD.scala:96)
在org.apache.spark.rdd.CoMatercedRDD $$ anonfun $ compute $ 1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)
at scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)
at org.apache .spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
在org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:25 8)
... 8 more

驱动程序堆栈跟踪:


解决方案

我的问题确实是 coalesce()
我做的是不使用 coalesce()导出文件,而是使用 df.write.parquet(testP)。然后回读文件并用 coalesce(1)



输出。希望它也适用于您。

I'm invoking Pyspark with Spark 2.0 in local mode with the following command:

pyspark --executor-memory 4g --driver-memory 4g

The input dataframe is being read from a tsv file and has 580 K x 28 columns. I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.

df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')

Any pointers how to get rid of this error. I can easily display the df or count the rows.

The output dataframe is 3100 rows with 23 columns

Error:

Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
    at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
    at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
    at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
    ... 8 more

Driver stacktrace:

解决方案

The problem for me was indeed coalesce(). What I did was exporting the file not using coalesce() but parquet instead using df.write.parquet("testP"). Then read back the file and export that with coalesce(1).

Hopefully it works for you as well.

这篇关于java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆