如何更改列类型在SQL星火的数据帧? [英] How to change column types in Spark SQL's DataFrame?
问题描述
假如我这样做是这样的:
VAL DF = sqlContext.load(com.databricks.spark.csv,地图(路径 - >中cars.csv,头 - >中真正))
df.printSchema()根
| - 年:字符串(可为空=真)
| - 化妆:字符串(可为空=真)
| - 模型:字符串(可为空=真)
| - 注释:字符串(可为空=真)
| - 空白:字符串(可为空=真)df.show()
今年彩妆模型评论空白
2012特斯拉S无可评论
1997年福特E350去得到一个现在的...
但我真的希望年
为内部
(也许改变一些列)。
我能想出的最好的是
df.withColumn(YEAR2,year.cast(内部))。选择(YEAR2为一年,让模式,评论,空白)
org.apache.spark.sql.DataFrame = [同期:INT,使:字符串,型号:字符串,注释:字符串,空白:字符串]
这是一个有点令人费解。
我从研发来了,我已经习惯了能够编写的,例如
DF2< - DF%GT;%
变异(年=一年%GT;%as.integer,
使=使得%GT;%TOUPPER)
我可能失去了一些东西,因为应该有更好的方式火花做到这一点/斯卡拉...
(_.toInt)
VAL toDouble = UDF [双,字符串](_.toDouble)
VAL toHour = UDF((T:字符串)=>中04D%.format(t.toInt)。取(2).toInt)
VAL days_since_nearest_holidays = UDF(
(同期:字符串,月:字符串,请将dayOfMonth:字符串)=> year.toInt + 27 + month.toInt-12
)
更改列类型,甚至建立另一个新的数据帧,可以这样写的:
VAL featureDf = DF
.withColumn(departureDelay,toDouble(DF(DepDelay)))
.withColumn(departureHour,toHour(DF(CRSDepTime)))
.withColumn(一周中的某天,toInt(DF(星期几)))
.withColumn(请将dayOfMonth,toInt(DF(DAYOFMONTH)))
.withColumn(月,toInt(DF(月)))
.withColumn(距离,toDouble(DF(距离)))
.withColumn(nearestHoliday,days_since_nearest_holidays(
DF(年),DF(月),DF(DAYOFMONTH))
)
。选择(departureDelay,departureHour,一周中的某天,请将dayOfMonth,
月,距离,nearestHoliday)
这将产生:
斯卡拉> df.printSchema
根
| - departureDelay:双(可为空=真)
| - departureHour:整数(可为空=真)
| - 工作日:整数(可为空=真)
| - 请将dayOfMonth:整数(可为空=真)
| - 月:整数(可为空=真)
| - 距离:双(可为空=真)
| - nearestHoliday:整数(可为空=真)
这是pretty接近自己的解决方案。简单地说,保持类型变化等转换为独立的 UDF VAL
S补充了code更具可读性和可重复使用的。
Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
but I really wanted the year
as Int
(and perhaps transform some other columns).
The best I could come up with is
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in spark/scala...
[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn
, withColumnRenamed
and cast
put forward by msemelman, Martin Senne and others are simpler and cleaner].
I think your approach is ok, recall that a Spark DataFrame
is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame
each time with a new schema.
Assuming you have an original df with the following schema:
scala> df.printSchema
root
|-- Year: string (nullable = true)
|-- Month: string (nullable = true)
|-- DayofMonth: string (nullable = true)
|-- DayOfWeek: string (nullable = true)
|-- DepDelay: string (nullable = true)
|-- Distance: string (nullable = true)
|-- CRSDepTime: string (nullable = true)
And some UDF's defined on one or several columns:
import org.apache.spark.sql.functions._
val toInt = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour = udf((t: String) => "%04d".format(t.toInt).take(2).toInt )
val days_since_nearest_holidays = udf(
(year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
)
Changing column types or even building a new DataFrame from another can be written like this:
val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour", toHour(df("CRSDepTime")))
.withColumn("dayOfWeek", toInt(df("DayOfWeek")))
.withColumn("dayOfMonth", toInt(df("DayofMonth")))
.withColumn("month", toInt(df("Month")))
.withColumn("distance", toDouble(df("Distance")))
.withColumn("nearestHoliday", days_since_nearest_holidays(
df("Year"), df("Month"), df("DayofMonth"))
)
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth",
"month", "distance", "nearestHoliday")
which yields:
scala> df.printSchema
root
|-- departureDelay: double (nullable = true)
|-- departureHour: integer (nullable = true)
|-- dayOfWeek: integer (nullable = true)
|-- dayOfMonth: integer (nullable = true)
|-- month: integer (nullable = true)
|-- distance: double (nullable = true)
|-- nearestHoliday: integer (nullable = true)
This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf val
s make the code more readable and re-usable.
这篇关于如何更改列类型在SQL星火的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!