PySpark:将String类型的DataFrame列转换为Double时发生KeyError [英] PySpark : KeyError when converting a DataFrame column of String type to Double

查看：264 发布时间：2020/5/4 10:24:48 python machine-learning pyspark user-defined-functions apache-spark-2.0

本文介绍了PySpark:将String类型的DataFrame列转换为Double时发生KeyError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用PySpark学习机器学习.我有一个包含几个String列的数据集，这些列以True or False or Yes or No作为其值.我正在使用DecisionTree，我想将这些String值转换为相应的Double值，即True, Yes应该更改为1.0，而False, No应该更改为0.0.我看到了一个教程，他们做了同样的事情，然后我想到了这段代码

I'm trying to learn machine learning with PySpark. I have a dataset that has a couple of String columns which have either True or False or Yes or No as its value. I'm working with DecisionTree and I wanted to convert these String values to corresponding Double values i.e. True, Yes should change to 1.0 and False, No should change to 0.0. I saw a tutorial where they did the same thing and I came up with this code

df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header=True)

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

csv_data = df.drop('State').drop('Area code') \
    .withColumn('Churn', toNum(df['Churn'])) \
    .withColumn('International plan', toNum(df['International plan'])) \
    .withColumn('Voice mail plan', toNum(df['Voice mail plan'])).cache()

但是，当我运行此命令时，会出现很多看起来像这样的错误.

However when I run this, I get so many errors that look like this.

  File "C:\..\spark-2.1.0\python\lib\pyspark.zip\pyspark\worker.py", line 70, in <lambda>
  File "C:\..\workspace\PyML\src\ModelBuilding.py", line 20, in <lambda>
    toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
KeyError: False

注意:我正在使用Spark 2.1和Python 3.5在PySpark上工作，我想我遵循的教程将使用spark 1.6和Python 2.7.所以我不认为这是Python语法问题之一.

Note: I'm working on PySpark with Spark 2.1 and Python 3.5 and I guess the tutorial that I follow uses spark 1.6 and Python 2.7. So I don't if this is one of the Python grammar issues.

推荐答案

我通过将映射部分更改为:

I solved it by changing mapping part to:

binary_map = {'Yes':1.0, 'No':0.0, True : 1.0, False : 0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

我刚刚删除了对与错的引号.我以为这很奇怪，但是当我使用print(df.printSchema())检查DataFrame的架构时，它表明具有True和False值的字段的类型为boolean.

I just removed the quote from True and False. I thought that was weird but when I checked the schema of the DataFrame using print(df.printSchema()), it showed that the field that has True and False values is of type boolean.

架构

root
 |-- State: string (nullable = true)
 |-- Account length: integer (nullable = true)
 |-- Area code: integer (nullable = true)
 |-- International plan: string (nullable = true)
 |-- Voice mail plan: string (nullable = true)
  .
  .
  .
 |-- Customer service calls: integer (nullable = true)
 |-- Churn: boolean (nullable = true)

这就是为什么我不得不取消报价.谢谢.

So that's why I had to take the quotes off. Thank you.

这篇关于PySpark:将String类型的DataFrame列转换为Double时发生KeyError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:将String类型的DataFrame列转换为Double时发生KeyError [英] PySpark : KeyError when converting a DataFrame column of String type to Double

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

PySpark:将String类型的DataFrame列转换为Double时发生KeyError [英] PySpark : KeyError when converting a DataFrame column of String type to Double

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭