更好的字符串字段转换为时间戳星火路 [英] Better way to convert a string field into timestamp in Spark
问题描述
我有一个CSV文件,其中一个字段时间戳类型。因为格式不预期的格式,我不能直接导入的时间戳。所以我将其导入为字符串,并将其转换成一个像这样
I have a csv in which a field has timestamp type. I cannot import directly as timestamp because the format is not in the expected format. So I import it as string and convert it into one like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
def getTimestamp(x:Any) :java.sql.Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
import org.apache.spark.sql.Row
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
有没有更好更简洁的方式来做到这一点,与DDF API或火花SQL。上述方法requries RDD的创建,并再次给了DDF架构。
Is there a better more concise way to do this, with ddf api or spark-sql. Above method requries creation of RDD and giving the schema for ddf again.
推荐答案
我在我的数据集ISO8601时间戳,我需要为YYYY-MM-DD格式转换。这就是我所做的:
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
你可以只使用UDF在火花的SQL查询。
And you can just use the UDF in your spark SQL query.
这篇关于更好的字符串字段转换为时间戳星火路的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!