如何更改 Spark SQL 的 DataFrame 中的列类型? [英] How can I change column types in Spark SQL's DataFrame?

查看:43
本文介绍了如何更改 Spark SQL 的 DataFrame 中的列类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在做类似的事情:

Suppose I'm doing something like:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment
1997 Ford  E350  Go get one now th...

但我真的希望将 year 作为 Int(可能还转换一些其他列).

But I really wanted the year as Int (and perhaps transform some other columns).

我能想到的最好的是

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

有点复杂.

我来自 R,我已经习惯了能够写作,例如

I'm coming from R, and I'm used to being able to write, e.g.

df2 <- df %>%
   mutate(year = year %>% as.integer,
          make = make %>% toupper)

我可能遗漏了一些东西,因为在 Spark/Scala 中应该有更好的方法来做到这一点...

I'm likely missing something, since there should be a better way to do this in Spark/Scala...

推荐答案

最新版本

从 spark 2.x 开始,您可以使用 .withColumn.在此处查看文档:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

从 Spark 1.4 版开始,您可以在列上应用带有 DataType 的 cast 方法:

Since Spark version 1.4 you can apply the cast method with DataType on the column:

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
    .drop("year")
    .withColumnRenamed("yearTmp", "year")

如果你正在使用 sql 表达式,你也可以这样做:

If you are using sql expressions you can also do:

val df2 = df.selectExpr("cast(year as int) year", 
                        "make", 
                        "model", 
                        "comment", 
                        "blank")

有关更多信息,请查看文档:http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

这篇关于如何更改 Spark SQL 的 DataFrame 中的列类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆