从numpy矩阵创建Spark数据框 [英] Creating Spark dataframe from numpy matrix
问题描述
这是我第一次使用PySpark(Spark 2),我正在尝试为Logit模型创建玩具数据框.我成功运行了教程,并希望通过将自己的数据放入其中.
it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.
我已经尝试过了:
%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(df,["label", "features"])
但我不能摆脱:
TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector
我正在使用ML库作为向量,并且输入是双精度数组,所以有什么用呢?根据
I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.
非常感谢.
推荐答案
您正在混合ML和MLlib中的功能,这些功能不一定兼容.使用spark-ml
时不需要LabeledPoint
:
You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint
when using spark-ml
:
sc.version
# u'2.1.1'
import numpy as np
from pyspark.ml.linalg import Vectors
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)
# +-----+-------------+
# |label| features|
# +-----+-------------+
# | 1|[0.0,0.0,0.0]|
# | 0|[0.0,1.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# | 1|[0.0,0.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# +-----+-------------+
PS::从Spark 2.0开始,spark.mllib软件包中基于RDD的API已进入维护模式. Spark的主要机器学习API现在是spark.ml软件包中基于DataFrame的API. [ref.]
PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]
这篇关于从numpy矩阵创建Spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!