如何更改列类型在SQL星火的数据帧? [英] How to change column types in Spark SQL's DataFrame?

查看:293
本文介绍了如何更改列类型在SQL星火的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假如我这样做是这样的:

  VAL DF = sqlContext.load(com.databricks.spark.csv,地图(路径 - >中cars.csv,头 - >中真正))
df.printSchema()根
 | - 年:字符串(可为空=真)
 | - 化妆:字符串(可为空=真)
 | - 模型:字符串(可为空=真)
 | - 注释:字符串(可为空=真)
 | - 空白:字符串(可为空=真)df.show()
今年彩妆模型评论空白
2012特斯拉S无可评论
1997年福特E350去得到一个现在的...

但我真的希望内部(也许改变一些列)。

我能想出的最好的是

  df.withColumn(YEAR2,year.cast(内部))。选择(YEAR2为一年,让模式,评论,空白)
org.apache.spark.sql.DataFrame = [同期:INT,使:字符串,型号:字符串,注释:字符串,空白:字符串]

这是一个有点令人费解。

我从研发来了,我已经习惯了能够编写的,例如

  DF2<  -  DF%GT;%
   变异(年=一年%GT;%as.integer,
          使=使得%GT;%TOUPPER)

我可能失去了一些东西,因为应该有更好的方式火花做到这一点/斯卡拉...


解决方案

(_.toInt)
VAL toDouble = UDF [双,字符串](_.toDouble)
VAL toHour = UDF((T:字符串)=>中04D%.format(t.toInt)。取(2).toInt)
VAL days_since_nearest_holidays = UDF(
  (同期:字符串,月:字符串,请将dayOfMonth:字符串)=> year.toInt + 27 + month.toInt-12
 )

更改列类型,甚至建立另一个新的数据帧,可以这样写的:

  VAL featureDf = DF
.withColumn(departureDelay,toDouble(DF(DepDelay)))
.withColumn(departureHour,toHour(DF(CRSDepTime)))
.withColumn(一周中的某天,toInt(DF(星期几)))
.withColumn(请将dayOfMonth,toInt(DF(DAYOFMONTH)))
.withColumn(月,toInt(DF(月)))
.withColumn(距离,toDouble(DF(距离)))
.withColumn(nearestHoliday,days_since_nearest_holidays(
              DF(年),DF(月),DF(DAYOFMONTH))
            )
。选择(departureDelay,departureHour,一周中的某天,请将dayOfMonth,
        月,距离,nearestHoliday)

这将产生:

 斯卡拉> df.printSchema

 | - departureDelay:双(可为空=真)
 | - departureHour:整数(可为空=真)
 | - 工作日:整数(可为空=真)
 | - 请将dayOfMonth:整数(可为空=真)
 | - 月:整数(可为空=真)
 | - 距离:双(可为空=真)
 | - nearestHoliday:整数(可为空=真)

这是pretty接近自己的解决方案。简单地说,保持类型变化等转换为独立的 UDF VAL S补充了code更具可读性和可重复使用的。

Suppose I'm doing something like:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment                
1997 Ford  E350  Go get one now th...  

but I really wanted the year as Int (and perhaps transform some other columns).

The best I could come up with is

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

which is a bit convoluted.

I'm coming from R, and I'm used to being able to write, e.g.

df2 <- df %>%
   mutate(year = year %>% as.integer, 
          make = make %>% toupper)

I'm likely missing something, since there should be a better way to do this in spark/scala...

解决方案

[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamed and cast put forward by msemelman, Martin Senne and others are simpler and cleaner].

I think your approach is ok, recall that a Spark DataFrame is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame each time with a new schema.

Assuming you have an original df with the following schema:

scala> df.printSchema
root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)

And some UDF's defined on one or several columns:

import org.apache.spark.sql.functions._

val toInt    = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour   = udf((t: String) => "%04d".format(t.toInt).take(2).toInt ) 
val days_since_nearest_holidays = udf( 
  (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
 )

Changing column types or even building a new DataFrame from another can be written like this:

val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour",  toHour(df("CRSDepTime")))
.withColumn("dayOfWeek",      toInt(df("DayOfWeek")))              
.withColumn("dayOfMonth",     toInt(df("DayofMonth")))              
.withColumn("month",          toInt(df("Month")))              
.withColumn("distance",       toDouble(df("Distance")))              
.withColumn("nearestHoliday", days_since_nearest_holidays(
              df("Year"), df("Month"), df("DayofMonth"))
            )              
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth", 
        "month", "distance", "nearestHoliday")            

which yields:

scala> df.printSchema
root
 |-- departureDelay: double (nullable = true)
 |-- departureHour: integer (nullable = true)
 |-- dayOfWeek: integer (nullable = true)
 |-- dayOfMonth: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- distance: double (nullable = true)
 |-- nearestHoliday: integer (nullable = true)

This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.

这篇关于如何更改列类型在SQL星火的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆