创建在Python星火数据框labeledPoints [英] Create labeledPoints from Spark DataFrame in Python
问题描述
什么.MAP()函数在Python做我用从一个火花数据帧创建一组labeledPoints的?什么是符号如果标签/结果不是第一列,但我可以参考它的列名,状态?
What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?
我创建了蟒蛇数据框与此.MAP()函数:
I create the python dataframe with this .map() function:
def parsePoint(line):
listmp = list(line.split('\t'))
dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
dataframe.insert(0, 'status', dataframe['accepted'])
if 'NULL' in dataframe.columns:
dataframe = dataframe.drop('NULL', axis=1)
if '' in dataframe.columns:
dataframe = dataframe.drop('', axis=1)
if 'rejected' in dataframe.columns:
dataframe = dataframe.drop('rejected', axis=1)
if 'accepted' in dataframe.columns:
dataframe = dataframe.drop('accepted', axis=1)
return dataframe
我把它转换成一个数据框火花的减少功能重组所有的大熊猫dataframes后。
I convert it to a spark dataframe after the reduce function has recombined all the pandas dataframes.
parsedData=sqlContext.createDataFrame(parsedData)
但现在我怎么创造Python从这个labledPoints?我想这可能是另一个.MAP()函数?
But now how do I create labledPoints from this in python? I assume it may be another .map() function?
推荐答案
如果您已经拥有数字功能和不需要您可以使用其他的转换 VectorAssembler
结合包含列自变量:
If you already have numerical features and which require no additional transformations you can use VectorAssembler
to combine columns containing independent variables:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["your", "independent", "variables"],
outputCol="features")
transformed = assembler.transform(parsedData)
接下来,你可以简单的映射:
Next you can simply map:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
(transformed.select(col("outcome_column").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
这篇关于创建在Python星火数据框labeledPoints的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!