PySpark:将String类型的DataFrame列转换为Double时发生KeyError [英] PySpark : KeyError when converting a DataFrame column of String type to Double
问题描述
我正在尝试使用PySpark
学习机器学习.我有一个包含几个String
列的数据集,这些列以True or False or Yes or No
作为其值.我正在使用DecisionTree
,我想将这些String
值转换为相应的Double
值,即True, Yes
应该更改为1.0
,而False, No
应该更改为0.0
.我看到了一个教程,他们做了同样的事情,然后我想到了这段代码
I'm trying to learn machine learning with PySpark
. I have a dataset that has a couple of String
columns which have either True or False or Yes or No
as its value. I'm working with DecisionTree
and I wanted to convert these String
values to corresponding Double
values i.e. True, Yes
should change to 1.0
and False, No
should change to 0.0
. I saw a tutorial where they did the same thing and I came up with this code
df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header=True)
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction
binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
csv_data = df.drop('State').drop('Area code') \
.withColumn('Churn', toNum(df['Churn'])) \
.withColumn('International plan', toNum(df['International plan'])) \
.withColumn('Voice mail plan', toNum(df['Voice mail plan'])).cache()
但是,当我运行此命令时,会出现很多看起来像这样的错误.
However when I run this, I get so many errors that look like this.
File "C:\..\spark-2.1.0\python\lib\pyspark.zip\pyspark\worker.py", line 70, in <lambda>
File "C:\..\workspace\PyML\src\ModelBuilding.py", line 20, in <lambda>
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
KeyError: False
注意:我正在使用Spark 2.1和Python 3.5在PySpark上工作,我想我遵循的教程将使用spark 1.6和Python 2.7.所以我不认为这是Python语法问题之一.
Note: I'm working on PySpark with Spark 2.1 and Python 3.5 and I guess the tutorial that I follow uses spark 1.6 and Python 2.7. So I don't if this is one of the Python grammar issues.
推荐答案
我通过将映射部分更改为:
I solved it by changing mapping part to:
binary_map = {'Yes':1.0, 'No':0.0, True : 1.0, False : 0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
我刚刚删除了对与错的引号.我以为这很奇怪,但是当我使用print(df.printSchema())
检查DataFrame
的架构时,它表明具有True和False值的字段的类型为boolean
.
I just removed the quote from True and False. I thought that was weird but when I checked the schema of the DataFrame
using print(df.printSchema())
, it showed that the field that has True and False values is of type boolean
.
架构
root
|-- State: string (nullable = true)
|-- Account length: integer (nullable = true)
|-- Area code: integer (nullable = true)
|-- International plan: string (nullable = true)
|-- Voice mail plan: string (nullable = true)
.
.
.
|-- Customer service calls: integer (nullable = true)
|-- Churn: boolean (nullable = true)
这就是为什么我不得不取消报价.谢谢.
So that's why I had to take the quotes off. Thank you.
这篇关于PySpark:将String类型的DataFrame列转换为Double时发生KeyError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!