如何将数据框的所有列转换为数字火花 scala? [英] How to convert all column of dataframe to numeric spark scala?

查看:28
本文介绍了如何将数据框的所有列转换为数字火花 scala?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我加载了一个 csv 作为数据框.我想将所有列转换为浮动,知道文件太大无法写入所有列名:

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")

推荐答案

以这个 DataFrame 为例:

Given this DataFrame as example:

val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")

使用架构:

StructType(
    StructField(id,StringType,true), 
    StructField(c0,IntegerType,false))

您可以通过 .columns 函数遍历 DF 列:

You can loop over DF columns by .columns functions:

val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))

所以新的 DF 架构看起来像:

So the new DF schema looks like:

StructType(
    StructField(id,FloatType,true), 
    StructField(c0,FloatType,false))

如果您想从转换中排除某些列,您可以执行以下操作(假设我们要排除列 id):

If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):

val exclude = Array("id")

val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
                                              current.withColumn(c, col(c).cast("float")))

其中 exclude 是我们想要从转换中排除的所有列的数组.

where exclude is an Array of all columns we want to exclude from casting.

所以这个新 DF 的架构是:

So the schema of this new DF is:

StructType(
    StructField(id,StringType,true), 
    StructField(c0,FloatType,false))

请注意,这可能不是最好的解决方案,但可以作为一个起点.

Please notice that maybe this is not the best solution to do it but it can be a starting point.

这篇关于如何将数据框的所有列转换为数字火花 scala?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆