在 Python 中从 Spark DataFrame 创建标记点 [英] Create labeledPoints from Spark DataFrame in Python

查看:19
本文介绍了在 Python 中从 Spark DataFrame 创建标记点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 python 中的什么 .map() 函数从 spark 数据帧创建一组 labeledPoints ?如果标签/结果不是第一列,但我可以参考其列名状态",则表示法是什么?

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?

我用这个 .map() 函数创建了 Python 数据框:

I create the Python dataframe with this .map() function:

def parsePoint(line):
    listmp = list(line.split('\t'))
    dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
    dataframe.insert(0, 'status', dataframe['accepted'])
    if 'NULL' in dataframe.columns:
        dataframe = dataframe.drop('NULL', axis=1)  
    if '' in dataframe.columns:
        dataframe = dataframe.drop('', axis=1)  
    if 'rejected' in dataframe.columns:
        dataframe = dataframe.drop('rejected', axis=1)  
    if 'accepted' in dataframe.columns:
        dataframe = dataframe.drop('accepted', axis=1)  
    return dataframe 

在 reduce 函数重新组合所有 Pandas 数据帧后,我将其转换为 Spark 数据帧.

I convert it to a Spark dataframe after the reduce function has recombined all the Pandas dataframes.

parsedData=sqlContext.createDataFrame(parsedData)

但是现在我如何在 Python 中创建 labeledPoints 呢?我认为它可能是另一个 .map() 函数?

But now how do I create labledPoints from this in Python? I assume it may be another .map() function?

推荐答案

如果您已经拥有数值特征并且不需要额外的转换,您可以使用 VectorAssembler 来组合包含自变量的列:

If you already have numerical features and which require no additional transformations you can use VectorAssembler to combine columns containing independent variables:

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["your", "independent", "variables"],
    outputCol="features")

transformed = assembler.transform(parsedData)

接下来你可以简单地映射:

Next you can simply map:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

(transformed.select(col("outcome_column").alias("label"), col("features"))
  .rdd
  .map(lambda row: LabeledPoint(row.label, row.features)))

从 Spark 2.0 开始,mlmllib API 不再兼容,后者将被弃用和移除.如果您仍然需要它,您必须将 ml.Vectors 转换为 mllib.Vectors.

As of Spark 2.0 ml and mllib API are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectors to mllib.Vectors.

from pyspark.mllib import linalg as mllib_linalg
from pyspark.ml import linalg as ml_linalg

def as_old(v):
    if isinstance(v, ml_linalg.SparseVector):
        return mllib_linalg.SparseVector(v.size, v.indices, v.values)
    if isinstance(v, ml_linalg.DenseVector):
        return mllib_linalg.DenseVector(v.values)
    raise ValueError("Unsupported type {0}".format(type(v)))

和地图:

lambda row: LabeledPoint(row.label, as_old(row.features)))

这篇关于在 Python 中从 Spark DataFrame 创建标记点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆