'RDD'对象没有属性'_jdf'pyspark RDD [英] 'RDD' object has no attribute '_jdf' pyspark RDD
问题描述
我是pyspark的新手.我想对文本文件进行一些机器学习.
I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
对于我的最后一条命令,我得到了错误 "AttributeError:"RDD"对象没有属性"_jdf"
and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'
任何人都可以帮我吗. 谢谢
can anyone help me please. thank you
推荐答案
您不应将rdd
与CountVectorizer
一起使用.相反,您应该尝试在dataframe
本身中将单词数组形成为
You shouldn't be using rdd
with CountVectorizer
. Instead you should try to form the array of words in the dataframe
itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
然后它应该起作用,以便您可以将
And then it should work so that you can call transform
function as
vectorizer_transformer.transform(td).show(truncate=False)
现在,如果您要坚持转换为 rdd样式的旧样式,则必须修改某些代码行.以下是您修改后的完整代码(有效)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
但是我建议您坚持使用dataframe
方式.
But I would suggest you to stick with dataframe
way.
这篇关于'RDD'对象没有属性'_jdf'pyspark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!