如何更改列类型在SQL星火的数据帧？ [英] How to change column types in Spark SQL's DataFrame?

查看：293 发布时间：2016/5/22 15:13:58 scala apache-spark apache-spark-sql

本文介绍了如何更改列类型在SQL星火的数据帧？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假如我这样做是这样的：

  VAL DF = sqlContext.load（com.databricks.spark.csv，地图（路径 - ＆gt;中cars.csv，头 - ＆gt;中真正））
df.printSchema（）根
 |  - 年：字符串（可为空=真）
 |  - 化妆：字符串（可为空=真）
 |  - 模型：字符串（可为空=真）
 |  - 注释：字符串（可为空=真）
 |  - 空白：字符串（可为空=真）df.show（）
今年彩妆模型评论空白
2012特斯拉S无可评论
1997年福特E350去得到一个现在的...

但我真的希望年为内部（也许改变一些列）。

我能想出的最好的是

  df.withColumn（YEAR2，year.cast（内部））。选择（YEAR2为一年，让模式，评论，空白）
org.apache.spark.sql.DataFrame = [同期：INT，使：字符串，型号：字符串，注释：字符串，空白：字符串]

这是一个有点令人费解。

我从研发来了，我已经习惯了能够编写的，例如

  DF2＆LT;  -  DF％GT;％
   变异（年=一年％GT;％as.integer，
          使=使得％GT;％TOUPPER）

我可能失去了一些东西，因为应该有更好的方式火花做到这一点/斯卡拉...

解决方案

（_.toInt）
VAL toDouble = UDF [双，字符串]（_.toDouble）
VAL toHour = UDF（（T：字符串）=＆gt;中04D％.format（t.toInt）。取（2）.toInt）
VAL days_since_nearest_holidays = UDF（
（同期：字符串，月：字符串，请将dayOfMonth：字符串）=＆GT; year.toInt + 27 + month.toInt-12
）

更改列类型，甚至建立另一个新的数据帧，可以这样写的：

  VAL featureDf = DF
.withColumn（departureDelay，toDouble（DF（DepDelay）））
.withColumn（departureHour，toHour（DF（CRSDepTime）））
.withColumn（一周中的某天，toInt（DF（星期几）））
.withColumn（请将dayOfMonth，toInt（DF（DAYOFMONTH）））
.withColumn（月，toInt（DF（月）））
.withColumn（距离，toDouble（DF（距离）））
.withColumn（nearestHoliday，days_since_nearest_holidays（
              DF（年），DF（月），DF（DAYOFMONTH））
            ）
。选择（departureDelay，departureHour，一周中的某天，请将dayOfMonth，
        月，距离，nearestHoliday）

这将产生：

 斯卡拉＆GT; df.printSchema
根
 |  -  departureDelay：双（可为空=真）
 |  -  departureHour：整数（可为空=真）
 |  - 工作日：整数（可为空=真）
 |  - 请将dayOfMonth：整数（可为空=真）
 |  - 月：整数（可为空=真）
 |  - 距离：双（可为空=真）
 |  -  nearestHoliday：整数（可为空=真）

这是pretty接近自己的解决方案。简单地说，保持类型变化等转换为独立的 UDF VAL S补充了code更具可读性和可重复使用的。

Suppose I'm doing something like:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()

root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)

df.show()
year make  model comment              blank
2012 Tesla S     No comment                
1997 Ford  E350  Go get one now th...

but I really wanted the year as Int (and perhaps transform some other columns).

The best I could come up with is

df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

which is a bit convoluted.

I'm coming from R, and I'm used to being able to write, e.g.

df2 <- df %>%
   mutate(year = year %>% as.integer, 
          make = make %>% toupper)

I'm likely missing something, since there should be a better way to do this in spark/scala...

解决方案

[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamed and cast put forward by msemelman, Martin Senne and others are simpler and cleaner].

I think your approach is ok, recall that a Spark DataFrame is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame each time with a new schema.

Assuming you have an original df with the following schema:

scala> df.printSchema
root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)

And some UDF's defined on one or several columns:

import org.apache.spark.sql.functions._

val toInt    = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour   = udf((t: String) => "%04d".format(t.toInt).take(2).toInt ) 
val days_since_nearest_holidays = udf( 
  (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
 )

Changing column types or even building a new DataFrame from another can be written like this:

val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour",  toHour(df("CRSDepTime")))
.withColumn("dayOfWeek",      toInt(df("DayOfWeek")))              
.withColumn("dayOfMonth",     toInt(df("DayofMonth")))              
.withColumn("month",          toInt(df("Month")))              
.withColumn("distance",       toDouble(df("Distance")))              
.withColumn("nearestHoliday", days_since_nearest_holidays(
              df("Year"), df("Month"), df("DayofMonth"))
            )              
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth", 
        "month", "distance", "nearestHoliday")

which yields:

scala> df.printSchema
root
 |-- departureDelay: double (nullable = true)
 |-- departureHour: integer (nullable = true)
 |-- dayOfWeek: integer (nullable = true)
 |-- dayOfMonth: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- distance: double (nullable = true)
 |-- nearestHoliday: integer (nullable = true)

This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.

这篇关于如何更改列类型在SQL星火的数据帧？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何更改列类型在SQL星火的数据帧？ [英] How to change column types in Spark SQL's DataFrame?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何更改列类型在SQL星火的数据帧？ [英] How to change column types in Spark SQL&#39;s DataFrame?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何更改列类型在SQL星火的数据帧？ [英] How to change column types in Spark SQL's DataFrame?

登录关闭