在现有列数据帧的基础添加新列 [英] Add new column in DataFrame base on existing column
问题描述
我有datetime列csv文件:2011-05-02T04:52:09 + 00:00。
我使用Scala,该文件被加载到火花数据框中,我可以使用jodas时间来解析日期:
VAL sqlContext =新SQLContext(SC)
进口sqlContext.implicits._
VAL DF =新SQLContext(SC).load(com.databricks.spark.csv,地图(路径 - >中data.csv,头 - >中真正的))
VAL D = org.joda.time.format.DateTimeFormat.forPattern(YYYY-MM-dd'T'kk:MM:SSZ)
我想为timeserie分析创建日期时间字段新列的基础。
在数据帧,我该如何对另一列的值来创建柱基?
我注意到有数据框下面的函数:df.withColumn(DT,列),有没有办法建立在现有的列值的列基
?感谢
进口org.apache.spark.sql.types.DateType
进口org.apache.spark.sql.functions._
进口org.joda.time.DateTime
进口org.joda.time.format.DateTimeFormatVAL D = DateTimeFormat.forPattern(YYYY-MM-dd'T'kk:MM:SSZ)
VAL dtFunc:(字符串=>日期)=(ARG1:字符串)=> DateTime.parse(ARG1,D).toDate
VAL X = df.withColumn(DT,callUDF(dtFunc,DateType,列(DT_STRING)))
的 callUDF
,山坳
包含在功能
作为导入
显示
的 DT_STRING
在 COL(DT_STRING)
是DF,您要的由来列名从转变。
另外,你可以取代过去的语句:
VAL dtFunc2 = UDF(dtFunc)
VAL X = df.withColumn(DT,dtFunc2(COL(DT_STRING)))
I have a csv file with datetime column: "2011-05-02T04:52:09+00:00".
I am using scala, the file is loaded into spark DataFrame and I can use jodas time to parse the date:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true"))
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
I would like to create new columns base on datetime field for timeserie analysis.
In DataFrame, how do I create a column base on value of another column?
I notice DataFrame has following function: df.withColumn("dt",column), is there a way to create a column base on value of existing column?
Thanks
import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))
The callUDF
, col
are included in functions
as the import
show
The dt_string
inside col("dt_string")
is the origin column name of your df, which you want to transform from.
Alternatively, you could replace the last statement with:
val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))
这篇关于在现有列数据帧的基础添加新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!