'RDD' 对象没有属性 '_jdf' pyspark RDD [英] 'RDD' object has no attribute '_jdf' pyspark RDD
问题描述
我是 pyspark 的新手.我想对文本文件执行一些机器学习.
I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
对于我的最后一个命令,我得到了错误"AttributeError: 'RDD' 对象没有属性 '_jdf'
and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'
谁能帮帮我.谢谢
推荐答案
您不应该将 rdd
与 CountVectorizer
一起使用.相反,您应该尝试将 dataframe
本身中的 词数组 形成为
You shouldn't be using rdd
with CountVectorizer
. Instead you should try to form the array of words in the dataframe
itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
然后它应该可以工作,以便您可以将 transform
函数调用为
And then it should work so that you can call transform
function as
vectorizer_transformer.transform(td).show(truncate=False)
现在,如果您想坚持转换为 rdd 样式 的旧样式,那么您必须修改某些代码行.以下是您修改后的完整代码(工作)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
但我建议你坚持使用 dataframe
方式.
But I would suggest you to stick with dataframe
way.
这篇关于'RDD' 对象没有属性 '_jdf' pyspark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!