如何在pyspark数据框中将字符串类型的列转换为int形式? [英] How to convert column with string type to int form in pyspark data frame?

查看:952
本文介绍了如何在pyspark数据框中将字符串类型的列转换为int形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pyspark中有数据框。它的某些数字列包含 nan,因此当我读取数据并检查数据框的架构时,这些列将具有 string类型。我如何将它们更改为int类型。我用0替换了'nan'值,并再次检查了模式,然后它又显示了这些列的字符串类型。我遵循以下代码:

I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

我的数据如下所示:

my data looks like this:

此处播放和草稿包含整数值,但由于这些列中存在nan,因此将它们视为字符串类型。

here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.

推荐答案

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

您可以为每个列运行循环,但这是最简单的方法将字符串列转换为整数。

You can run loop for each column but this is the simplest way to convert string column into integer.

这篇关于如何在pyspark数据框中将字符串类型的列转换为int形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆