无法转换类型 <class 'pyspark.ml.linalg.SparseVector'>进入向量 [英] Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

查看:25
本文介绍了无法转换类型 <class 'pyspark.ml.linalg.SparseVector'>进入向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于我的 pyspark Row 对象:

Given my pyspark Row object:

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

然而,row.features 未能通过 isinstance(row.features,Vector) 测试.

However, row.features failed to pass isinstance(row.features,Vector) test.

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

这个奇怪的错误让我陷入了巨大的麻烦.如果不传递isinstance(row.features, Vector)",我将无法使用 map 函数生成 LabeledPoint.如果有人能解决这个问题,我将不胜感激.

This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.

推荐答案

这不太可能是错误.您没有提供重现问题所需的代码,但很可能您将 Spark 2.0 与 ML 转换器一起使用,并且比较了错误的实体.

It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.

让我们用一个例子来说明这一点.简单数据

Let's illustrate that with an example. Simple data

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

现在让我们导入不同的向量类:

Now lets import different vector classes:

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

并进行测试:

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

如您所见,我们拥有的是 pyspark.ml.linalg.Vector 而不是与旧 API 不兼容的 pyspark.mllib.linalg.Vector:>

As you see what we have is pyspark.ml.linalg.Vector not pyspark.mllib.linalg.Vector which is not compatible with the old API:

LabeledPoint(0.0, row.features)

TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

您可以将 ML 对象转换为 MLLib 对象:

You could convert ML object to MLLib one:

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

或者简单地说:

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

但一般来说,您应该避免在必要时出现的情况.

but generally speaking you should avoid situations when it is necessary.

这篇关于无法转换类型 &lt;class 'pyspark.ml.linalg.SparseVector'&gt;进入向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆