标记点对象pyspark中的错误 [英] error in labelled point object pyspark
问题描述
我正在编写一个函数
- 以RDD作为输入
- 分割逗号分隔的值
- 然后将每一行转换为带标签的点对象
-
最终将输出作为数据帧获取
- which takes a RDD as input
- splits the comma separated values
- then convert each row into labelled point object
finally fetch the output as a dataframe
code:
def parse_points(raw_rdd):
cleaned_rdd = raw_rdd.map(lambda line: line.split(","))
new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF()
return new_df
output = parse_points(input_rdd)
至此,如果我运行代码,则没有错误,可以正常工作.
upto this if I run the code, there is no error it is working fine.
但是在添加行时,
output.take(5)
我遇到了错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
20
21 output = parse_points(raw_rdd)
---> 22 print output.show()
请告诉我这是什么错误.
Please suggest me what is the mistake.
推荐答案
在执行操作之前没有错误的原因:
The reason you had no errors until you execute the action:
output.take(5)
是由于火花的性质,它是懒惰的. 即在您执行动作"take(5)"之前,火花中什么都没有执行
Is due to the nature of spark, which is lazy. i.e. nothing was execute in spark until you execute the action "take(5)"
您的代码中有几个问题,我认为您由于[line [1:]]中多余的"["和]"而失败了
You have a few issues in your code, and I think that you are failing due to extra "[" and "]" in [line[1:]]
因此,您需要在[line [1:]]中删除多余的"["和]"(并仅保留第[1:]行)
So you need to remove extra "[" and "]" in [line[1:]] (and keep only the line[1:])
您可能需要解决的另一个问题是缺少数据框架构.
Another issue which you might need to solve is the lack of dataframe schema.
即将"toDF()"替换为"toDF(["features","label"]) 这将为数据框提供一个架构.
i.e. replace "toDF()" with "toDF(["features","label"])" This will give the dataframe a schema.
这篇关于标记点对象pyspark中的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!