在PySpark中运行KMeans集群 [英] Running KMeans clustering in PySpark
问题描述
这是我第一次尝试在Spark中运行KMeans聚类分析,因此,我很抱歉遇到一个愚蠢的问题.
it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
我有一个包含许多列的spark数据框mydataframe
.我只想在两列上运行kmeans:lat
和long
(纬度和经度),将它们用作简单值.我只想基于这2列提取7个集群.我尝试过:
I have a spark dataframe mydataframe
with many columns. I want to run kmeans on only two columns: lat
and long
(latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
但是我遇到一个错误:
"DataFrame"对象没有属性"map"
'DataFrame' object has no attribute 'map'
一个要馈给KMeans.train
的对象应该是什么?
显然,它不接受DataFrame.
我应该如何准备数据框架进行分析?
What should be the object one feeds to KMeans.train
?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
非常感谢!
推荐答案
方法KMeans.train将RDD而不是数据帧(数据)作为输入.因此,您只需要将数据转换为rdd:data.rdd. 希望对您有所帮助.
the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd. Hope it helps.
这篇关于在PySpark中运行KMeans集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!