如何将类型 Row 转换为 Vector 以提供给 KMeans [英] How to convert type Row into Vector to feed to the KMeans

查看:27
本文介绍了如何将类型 Row 转换为 Vector 以提供给 KMeans的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将 df2 提供给 kmeans 时,出现以下错误

when i try to feed df2 to kmeans i get the following error

clusters = KMeans.train(df2, 10, maxIterations=30,
                        runs=10, initializationMode="random")

我得到的错误:

Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

df2 是一个创建如下的数据框:

df2 is a dataframe created as follow:

df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')

df2.show()


     latitude|       longitude|

   60.1643075|      24.9460844|
   60.4686748|      22.2774728|

如何将这两列转换为 Vector 并将其提供给 KMeans?

how can i convert this two columns to Vector and feed it to KMeans?

推荐答案

ML

问题是你错过了 文档示例,很明显,train 方法需要一个 DataFrame 和一个 Vector 作为功能.

ML

The problem is that you missed the documentation's example, and it's pretty clear that the method train requires a DataFrame with a Vector as features.

要修改当前数据的结构,您可以使用 VectorAssembler.在你的情况下,它可能是这样的:

To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:

from pyspark.sql.functions import *

vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
                                  outputCol="features")

# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c) 
        for c in vectorAssembler.getInputCols()]

df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)

此外,您还应该使用 MinMaxScaler 以获得更好的结果.

Besides, you should also normalize your features using the class MinMaxScaler to obtain better results.

为了使用 MLLib 实现这一点,您需要首先使用 map 函数,将所有 string 值转换为 Double,然后将它们合并到一个 密集向量.

In order to achieve this using MLLib you need to use a map function first, to convert all your string values into Double, and merge them together in a DenseVector.

rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))

此后,您可以训练您的 MLlib 的 KMeans 模型使用 rdd 变量.

After this point you can train your MLlib's KMeans model using the rdd variable.

这篇关于如何将类型 Row 转换为 Vector 以提供给 KMeans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆