如何在pyspark数据框中将字符串类型的列转换为int形式? [英] How to convert column with string type to int form in pyspark data frame?
问题描述
我在pyspark中有数据框。它的某些数字列包含 nan,因此当我读取数据并检查数据框的架构时,这些列将具有 string类型。我如何将它们更改为int类型。我用0替换了'nan'值,并再次检查了模式,然后它又显示了这些列的字符串类型。我遵循以下代码:
I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
我的数据如下所示:
my data looks like this:
此处播放和草稿包含整数值,但由于这些列中存在nan,因此将它们视为字符串类型。
here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.
推荐答案
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
您可以为每个列运行循环,但这是最简单的方法将字符串列转换为整数。
You can run loop for each column but this is the simplest way to convert string column into integer.
这篇关于如何在pyspark数据框中将字符串类型的列转换为int形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!