使用Pyspark将密集向量转换为数据框 [英] Convert a Dense Vector to a Dataframe using Pyspark
问题描述
首先,我尝试使用下面链接中的所有内容来修复我的错误,但没有一个起作用。
Firstly I tried everything in the link below to fix my error but none of them worked.
如何将密集矢量的RDD转换为pyspark中的DataFrame?
我正在尝试将密集向量与列名一起转换为数据帧(最好是Spark)并遇到问题。
I am trying to convert a dense vector into a dataframe (Spark preferably) along with column names and running into issues.
我的列spark dataframe中的一个向量是使用Vector Assembler创建的向量,现在我想将其转换回一个数据框架,因为我想在向量中的某些变量上创建图。
My column in spark dataframe is a vector that was created using Vector Assembler and I now want to convert it back to a dataframe as I would like to create plots on some of the variables in the vector.
方法1:
from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.linalg import Vectors
temp=output.select("all_features")
temp.rdd.map(
lambda row: (DenseVector(row[0].toArray()))
).toDF()
以下是错误
TypeError: not supported type: <type 'numpy.ndarray'>
方法2:
from pyspark.ml.linalg import VectorUDT
from pyspark.sql.functions import udf
from pyspark.ml.linalg import *
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
result = output.withColumn("all_features", as_ml("all_features"))
result.head(5)
错误:
AttributeError: 'numpy.ndarray' object has no attribute 'asML'
I还尝试将数据框转换为Pandas数据框,之后我无法将值拆分为单独的列
I also tried to convert the dataframe into a Pandas dataframe and after that I am not able to split the values into separate columns
方法3:
pandas_df=temp.toPandas()
pandas_df1=pd.DataFrame(pandas_df.all_features.values.tolist())
上面的代码运行良好,但我的数据框中仍然只有一列,所有值都用逗号分隔为列表。
Above code runs fine but I still have only one column in my dataframe with all the values separated by commas as a list.
任何帮助我
编辑:
这是我的临时数据帧的样子。它只有一列all_features。我正在尝试创建一个将所有这些值拆分为单独列的数据框(all_features是使用200列创建的向量)
Here is how my temp dataframe looks like. It just has one column all_features. I am trying to create a dataframe that splits all of these values into separate columns (all_features is a vector that was created using 200 columns)
+--------------------+
| all_features|
+--------------------+
|[0.01193689934723...|
|[0.04774759738895...|
|[0.0,0.0,0.194417...|
|[0.02387379869447...|
|[1.89796699621085...|
+--------------------+
only showing top 5 rows
预期输出是一个数据帧,其中所有200列都在一个数据帧中分开
Expected output is a dataframe with all 200 columns separated out in a dataframe
+----------------------------+
| col1| col2| col3|...
+----------------------------+
|0.01193689934723|0.0|0.5049431301173817...
|0.04774759738895|0.0|0.1657316216149636...
|0.0|0.0|7.213126372469...
|0.02387379869447|0.0|0.1866693496827619|...
|1.89796699621085|0.0|0.3192169213385746|...
+----------------------------+
only showing top 5 rows
这是我的熊猫DF输出的样子
Here is how my Pandas DF output looks like
0
0 [0.011936899347238104, 0.0, 0.5049431301173817...
1 [0.047747597388952415, 0.0, 0.1657316216149636...
2 [0.0, 0.0, 0.19441761495525278, 7.213126372469...
3 [0.023873798694476207, 0.0, 0.1866693496827619...
4 [1.8979669962108585, 0.0, 0.3192169213385746, ...
推荐答案
由于您希望所有功能都位于单独的列中(如我从EDIT中所获得的),因此指向您提供的答案的链接为
Since you want all the features in separate columns (as I got from your EDIT), the link to the answer you provided is not your solution.
尝试一下,
#column_names
temp = temp.rdd.map(lambda x:[float(y) for y in x['all_features']]).toDF(column_names)
编辑:
由于您的温度
最初是一个数据帧,您也可以使用此方法而无需将其转换为 rdd
,
Since your temp
is originally a dataframe, you can also use this method without converting it to rdd
,
import pyspark.sql.functions as F
from pyspark.sql.types import *
splits = [F.udf(lambda val: float(val[i].item()),FloatType()) for i in range(200)]
temp = temp.select(*[s(F.col('all_features')).alias(c) for c,s in zip(column_names,splits)])
temp.show()
这篇关于使用Pyspark将密集向量转换为数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!