从 numpy 矩阵创建 Spark 数据帧 [英] Creating Spark dataframe from numpy matrix

查看:39
本文介绍了从 numpy 矩阵创建 Spark 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次使用 PySpark(Spark 2),我正在尝试为 Logit 模型创建一个玩具数据框.我成功运行了教程,并希望通过我的自己的数据进去.

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

我已经试过了:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

但我无法摆脱:

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

我将 ML 库用于向量,输入是一个双数组,请问有什么问题?根据 文档.

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

非常感谢.

推荐答案

您正在混合来自 ML 和 MLlib 的功能,它们不一定兼容.使用 spark-ml 时不需要 LabeledPoint:

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS:从 Spark 2.0 开始,spark.mllib 包中基于 RDD 的 API 已进入维护模式.Spark 的主要机器学习 API 现在是 spark.ml 包中基于 DataFrame 的 API.[参考]

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

这篇关于从 numpy 矩阵创建 Spark 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆