在现有列数据帧的基础添加新列 [英] Add new column in DataFrame base on existing column

查看：220 发布时间：2016/5/22 15:36:12 scala apache-spark apache-spark-sql

本文介绍了在现有列数据帧的基础添加新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有datetime列csv文件：2011-05-02T04：52：09 + 00：00。

我使用Scala，该文件被加载到火花数据框中，我可以使用jodas时间来解析日期：

  VAL sqlContext =新SQLContext（SC）
进口sqlContext.implicits._
VAL DF =新SQLContext（SC）.load（com.databricks.spark.csv，地图（路径 - ＆gt;中data.csv，头 - ＆gt;中真正的））
VAL D = org.joda.time.format.DateTimeFormat.forPattern（YYYY-MM-dd'T'kk：MM：SSZ）

我想为timeserie分析创建日期时间字段新列的基础。

在数据帧，我该如何对另一列的值来创建柱基？

我注意到有数据框下面的函数：df.withColumn（DT，列），有没有办法建立在现有的列值的列基

？

感谢

解决方案

 进口org.apache.spark.sql.types.DateType
进口org.apache.spark.sql.functions._
进口org.joda.time.DateTime
进口org.joda.time.format.DateTimeFormatVAL D = DateTimeFormat.forPattern（YYYY-MM-dd'T'kk：MM：SSZ）
VAL dtFunc：（字符串=＆GT;日期）=（ARG1：字符串）=＆GT; DateTime.parse（ARG1，D）.toDate
VAL X = df.withColumn（DT，callUDF（dtFunc，DateType，列（DT_STRING）））

的 callUDF ，山坳包含在功能作为导入显示

的 DT_STRING 在 COL（DT_STRING）是DF，您要的由来列名从转变。

另外，你可以取代过去的语句：

  VAL dtFunc2 = UDF（dtFunc）
VAL X = df.withColumn（DT，dtFunc2（COL（DT_STRING）））

I have a csv file with datetime column: "2011-05-02T04:52:09+00:00".

I am using scala, the file is loaded into spark DataFrame and I can use jodas time to parse the date:

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true")) 
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")

I would like to create new columns base on datetime field for timeserie analysis.

In DataFrame, how do I create a column base on value of another column?

I notice DataFrame has following function: df.withColumn("dt",column), is there a way to create a column base on value of existing column?

Thanks

解决方案

import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat

val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))

The callUDF, col are included in functions as the import show

The dt_string inside col("dt_string") is the origin column name of your df, which you want to transform from.

Alternatively, you could replace the last statement with:

val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))

这篇关于在现有列数据帧的基础添加新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在现有列数据帧的基础添加新列 [英] Add new column in DataFrame base on existing column

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在现有列数据帧的基础添加新列 [英] Add new column in DataFrame base on existing column

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭