如何将数据框的所有列转换为数字Spark Scala? [英] How to convert all column of dataframe to numeric spark scala?
问题描述
我加载了一个csv作为数据框.我想将所有列都强制转换为浮点数,因为知道文件很大,所以可以写所有列的名称:
I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
推荐答案
以此DataFrame为例:
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
具有架构:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
您可以通过 .columns 函数在DF列上循环:
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
所以新的DF模式如下:
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
如果您想从投射中排除某些列,则可以执行以下操作(假设我们要排除 id 列):
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
其中 exclude
是我们要从转换中排除的所有列的数组.
where exclude
is an Array of all columns we want to exclude from casting.
因此,此新DF的架构为:
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
请注意,这也许不是最好的解决方案,但它可能是一个起点.
Please notice that maybe this is not the best solution to do it but it can be a starting point.
这篇关于如何将数据框的所有列转换为数字Spark Scala?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!