如何将类型 Row 转换为 Vector 以提供给 KMeans [英] How to convert type Row into Vector to feed to the KMeans
问题描述
当我尝试将 df2 提供给 kmeans 时,出现以下错误
when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
我得到的错误:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2 是一个创建如下的数据框:
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
如何将这两列转换为 Vector 并将其提供给 KMeans?
how can i convert this two columns to Vector and feed it to KMeans?
推荐答案
ML
问题是你错过了 文档示例,很明显,train
方法需要一个 DataFrame
和一个 Vector
作为功能.
ML
The problem is that you missed the documentation's example, and it's pretty clear that the method train
requires a DataFrame
with a Vector
as features.
要修改当前数据的结构,您可以使用 VectorAssembler.在你的情况下,它可能是这样的:
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
此外,您还应该使用 MinMaxScaler 以获得更好的结果.
Besides, you should also normalize your features
using the class MinMaxScaler to obtain better results.
为了使用 MLLib
实现这一点,您需要首先使用 map 函数,将所有 string
值转换为 Double
,然后将它们合并到一个 密集向量.
In order to achieve this using MLLib
you need to use a map function first, to convert all your string
values into Double
, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
此后,您可以训练您的 MLlib 的 KMeans 模型使用 rdd
变量.
After this point you can train your MLlib's KMeans model using the rdd
variable.
这篇关于如何将类型 Row 转换为 Vector 以提供给 KMeans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!