将数据框拟合到randomForest pyspark [英] Fit a dataframe into randomForest pyspark

查看：69 发布时间：2020/9/4 2:34:40 python apache-spark pyspark apache-spark-ml

本文介绍了将数据框拟合到randomForest pyspark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个看起来像这样的DataFrame:

I have a DataFrame that looks like this:

+--------------------+------------------+
|            features|           labels |
+--------------------+------------------+
|[-0.38475, 0.568...]|          label1  |
|[0.645734, 0.699...]|          label2  |
|     .....          |          ...     |
+--------------------+------------------+

这两列都是String类型(StringType())，我想将其放入spark ml randomForest中.为此，我需要将features列转换为包含浮点数的向量.有谁知道怎么做吗?

Both columns are of String type (StringType()), I would like to fit this into spark ml randomForest. To do so, I need to convert the features columns into a vector containing floats. Does any one have any idea How to do so ?

推荐答案

如果您使用的是 Spark 2.x ，我相信这就是您所需要的:

If you are using Spark 2.x, I believe that this is what you need :

from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

def parse(s):
  try:
    return Vectors.parse(s).asML()
  except:
    return None

parse_ = udf(parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

使用 Spark 1.6 ，并没有太大不同:

With Spark 1.6, it isn't much different :

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

parse_ = udf(Vectors.parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

Vectors具有parse功能，可以帮助您实现要执行的操作.

Vectors has a parse function that can help you achieve what you are trying to do.

这篇关于将数据框拟合到randomForest pyspark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将数据框拟合到randomForest pyspark [英] Fit a dataframe into randomForest pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将数据框拟合到randomForest pyspark [英] Fit a dataframe into randomForest pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭