如何在 Pyspark 的 VectorAssembler 中使用字符串变量 [英] How to use string variables in VectorAssembler in Pyspark

查看：105 发布时间：2021/6/24 20:35:18 pyspark random-forest

本文介绍了如何在 Pyspark 的 VectorAssembler 中使用字符串变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在 Pyspark 上运行随机森林算法.Pyspark 文档中提到 VectorAssembler 只接受数字或布尔数据类型.因此，如果我的数据包含 Stringtype 变量，比如城市名称，我是否应该对它们进行单热编码以便进一步进行随机森林分类/回归?

I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression?

这是我一直在尝试的代码，输入文件是这里:

Here is the code I have been trying, input file is here:

train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)

我不断收到 IllegalArgumentException: u'Data type StringType is not supported.'

P.S.:抱歉问了一个基本问题.我来自 R 背景.在 R 中，当我们做随机森林时，不需要将分类变量转换为数值变量.

P.S.: Apologies for asking a basic question. I come from R background. In R, when we do Random Forests, there is no need to convert the categorical variables into numeric variables.

如何在 Pyspark 的 VectorAssembler 中使用字符串变量 [英] How to use string variables in VectorAssembler in Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在 Pyspark 的 VectorAssembler 中使用字符串变量 [英] How to use string variables in VectorAssembler in Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭