如何在 Pyspark 的 VectorAssembler 中使用字符串变量 [英] How to use string variables in VectorAssembler in Pyspark

查看:105
本文介绍了如何在 Pyspark 的 VectorAssembler 中使用字符串变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Pyspark 上运行随机森林算法.Pyspark 文档 中提到 VectorAssembler 只接受数字或布尔数据类型.因此,如果我的数据包含 Stringtype 变量,比如城市名称,我是否应该对它们进行单热编码以便进一步进行随机森林分类/回归?

I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression?

这是我一直在尝试的代码,输入文件是这里:

Here is the code I have been trying, input file is here:

train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)

我不断收到 IllegalArgumentException: u'Data type StringType is not supported.'

P.S.:抱歉问了一个基本问题.我来自 R 背景.在 R 中,当我们做随机森林时,不需要将分类变量转换为数值变量.

P.S.: Apologies for asking a basic question. I come from R background. In R, when we do Random Forests, there is no need to convert the categorical variables into numeric variables.

推荐答案

是的,你应该使用 StringIndexer,可能与 OneHotEncoder.您可以在链接的文档中找到关于这两者的更多信息.

Yes you should use StringIndexer, maybe together with OneHotEncoder. You can find more information on these two in the linked documentation.

这篇关于如何在 Pyspark 的 VectorAssembler 中使用字符串变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆