'RDD'对象没有属性'_jdf'pyspark RDD [英] 'RDD' object has no attribute '_jdf' pyspark RDD

查看:335
本文介绍了'RDD'对象没有属性'_jdf'pyspark RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是pyspark的新手.我想对文本文件进行一些机器学习.

I'm new in pyspark. I would like to perform some machine Learning on a text file.

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

对于我的最后一条命令,我得到了错误 "AttributeError:"RDD"对象没有属性"_jdf"

and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'

在此处输入图片描述

任何人都可以帮我吗. 谢谢

can anyone help me please. thank you

推荐答案

您不应将rddCountVectorizer一起使用.相反,您应该尝试在dataframe本身中将单词数组形成为

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

然后它应该起作用,以便您可以将

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

现在,如果您要坚持转换为 rdd样式的旧样式,则必须修改某些代码行.以下是您修改后的完整代码(有效)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

但是我建议您坚持使用dataframe方式.

But I would suggest you to stick with dataframe way.

这篇关于'RDD'对象没有属性'_jdf'pyspark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆