'RDD' 对象没有属性 '_jdf' pyspark RDD [英] 'RDD' object has no attribute '_jdf' pyspark RDD

查看:35
本文介绍了'RDD' 对象没有属性 '_jdf' pyspark RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 pyspark 的新手.我想对文本文件执行一些机器学习.

I'm new in pyspark. I would like to perform some machine Learning on a text file.

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

对于我的最后一个命令,我得到了错误"AttributeError: 'RDD' 对象没有属性 '_jdf'

and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'

在此处输入图片描述

谁能帮帮我.谢谢

推荐答案

您不应该将 rddCountVectorizer 一起使用.相反,您应该尝试将 dataframe 本身中的 词数组 形成为

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

然后它应该可以工作,以便您可以将 transform 函数调用为

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

现在,如果您想坚持转换为 rdd 样式 的旧样式,那么您必须修改某些代码行.以下是您修改后的完整代码(工作)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

但我建议你坚持使用 dataframe 方式.

But I would suggest you to stick with dataframe way.

这篇关于'RDD' 对象没有属性 '_jdf' pyspark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆