创建在Python星火数据框labeledPoints [英] Create labeledPoints from Spark DataFrame in Python

查看:575
本文介绍了创建在Python星火数据框labeledPoints的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么.MAP()函数在Python做我用从一个火花数据帧创建一组labeledPoints的?什么是符号如果标签/结果不是第一列,但我可以参考它的列名,状态?

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?

我创建了蟒蛇数据框与此.MAP()函数:

I create the python dataframe with this .map() function:

def parsePoint(line):
    listmp = list(line.split('\t'))
    dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
    dataframe.insert(0, 'status', dataframe['accepted'])
    if 'NULL' in dataframe.columns:
        dataframe = dataframe.drop('NULL', axis=1)  
    if '' in dataframe.columns:
        dataframe = dataframe.drop('', axis=1)  
    if 'rejected' in dataframe.columns:
        dataframe = dataframe.drop('rejected', axis=1)  
    if 'accepted' in dataframe.columns:
        dataframe = dataframe.drop('accepted', axis=1)  
    return dataframe 

我把它转换成一个数据框火花的减少功能重组所有的大熊猫dataframes后。

I convert it to a spark dataframe after the reduce function has recombined all the pandas dataframes.

parsedData=sqlContext.createDataFrame(parsedData)

但现在我怎么创造Python从这个labledPoints?我想这可能是另一个.MAP()函数?

But now how do I create labledPoints from this in python? I assume it may be another .map() function?

推荐答案

如果您已经拥有数字功能和不需要您可以使用其他的转换 VectorAssembler 结合包含列自变量:

If you already have numerical features and which require no additional transformations you can use VectorAssembler to combine columns containing independent variables:

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["your", "independent", "variables"],
    outputCol="features")

transformed = assembler.transform(parsedData)

接下来,你可以简单的映射:

Next you can simply map:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

(transformed.select(col("outcome_column").alias("label"), col("features"))
  .map(lambda row: LabeledPoint(row.label, row.features)))

这篇关于创建在Python星火数据框labeledPoints的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆