更好的字符串字段转换为时间戳星火路 [英] Better way to convert a string field into timestamp in Spark

查看:175
本文介绍了更好的字符串字段转换为时间戳星火路的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中一个字段时间戳类型。因为格式不预期的格式,我不能直接导入的时间戳。所以我将其导入为字符串,并将其转换成一个像这样

I have a csv in which a field has timestamp type. I cannot import directly as timestamp because the format is not in the expected format. So I import it as string and convert it into one like this

import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
def getTimestamp(x:Any) :java.sql.Timestamp = {
    val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
    if (x.toString() == "") 
    return null
    else {
        val d = format.parse(x.toString());
        val t = new Timestamp(d.getTime());
        return t
    }
}
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
    import org.apache.spark.sql.Row
    val d1 = getTimestamp(row(3))
    return Row(row(0),row(1),row(2),d1)
}

有没有更好更简洁的方式来做到这一点,与DDF API或火花SQL。上述方法requries RDD的创建,并再次给了DDF架构。

Is there a better more concise way to do this, with ddf api or spark-sql. Above method requries creation of RDD and giving the schema for ddf again.

推荐答案

我在我的数据集ISO8601时间戳,我需要为YYYY-MM-DD格式转换。这就是我所做的:

I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:

import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
  def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
  def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}

sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))

你可以只使用UDF在火花的SQL查询。

And you can just use the UDF in your spark SQL query.

这篇关于更好的字符串字段转换为时间戳星火路的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆