pyspark EOFError调用地图后 [英] pyspark EOFError after calling map

查看:192
本文介绍了pyspark EOFError调用地图后的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是火花&的新手. pyspark.

I am new to spark & pyspark.

我正在将一个小的csv文件(约40k)读入数据帧.

I am reading a small csv file (~40k) into a dataframe.

from pyspark.sql import functions as F
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]), features=Vectors.dense(x[1:]))).toDF()

我得到了一些奇怪的错误,这种错误并非每次都会发生,但确实会经常发生

I get some weird error that does not occur every single time, but does happen pretty regularly

>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999                                                                           
>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999                                                                           
>>> df2.show(1)
Traceback (most recent call last):
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker    
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

一旦引发EOFError,直到执行需要与Spark服务器进行交互的操作

Once that EOFError has been raised, I will not see it again until I do something that requires interacting with the spark server

当我调用df2.count()时,它显示[Stage xxx]提示,这是我将其发送到Spark服务器的意思.当我使用df2做某事时,所有触发的事情最终似乎都会再次给出EOFError.

When I call df2.count() it shows that [Stage xxx] prompt which is what I mean by it going to the spark server. Anything that triggers that seems to eventually end up giving the EOFError again when I do something with df2.

df(相对于df2)似乎没有发生这种情况,因此看来df.map()行中一定有这种情况.

It does not seem to happen with df (vs. df2) so seems like it must be something happening with the df.map() line.

推荐答案

能否将数据帧转换为rdd后尝试进行映射.您正在将地图函数应用于数据框,然后再次从中创建数据框.语法类似于

Can you please try to do map after converting dataframe into rdd. You are applying map function on a dataframe and then again creating a dataframe from that.Syntax would be like

df.rdd.map().toDF()

请让我知道它是否有效.谢谢.

Please let me know if it works. Thanks.

这篇关于pyspark EOFError调用地图后的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆