小于Spark Scala RDD中日期的比较 [英] Less than comparison for date in spark scala rdd

查看:456
本文介绍了小于Spark Scala RDD中日期的比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想打印1991年之前加入的员工的数据.以下是我的示例数据:

I want to print data of employees who joined before 1991. Below is my sample data:

69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001 

用于加载数据的初始RDD:

Initial RDD for loading data:

val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})

UDF,用于将字符串列转换为日期列:

UDF for converting string column to date column:

def convertStringToDate(s: String): Date = {
        val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
        dateFormat.parse(s)
    }

将每一列映射为其数据类型:

Mapping each and every column to its datatype:

val dateRdd=rdd.map(_.split(",")).map(p=>(if(p(0).length >0 )p(0).toLong else 0L,p(1),p(2),if(p(3).length > 0)p(3).toLong else 0L,convertStringToDate(p(4)),if(p(5).length >0)p(5).toDouble else 0D,if(p(6).length > 0)p(6).toDouble else 0D,if(p(7).length> 0)p(7).toInt else 0))  

现在,我以元组形式获取数据,如下所示:

Now I get data in tuples as below:

(69062,FRANK,ANALYST,5646,Tue Dec 03 00:00:00 IST 1991,3100.0,0.0,2001)
(63679,SANDRINE,CLERK,69062,Tue Dec 18 00:00:00 IST 1990,900.0,0.0,2001)

现在,当我执行命令时,出现以下错误:

Now when I execute command I am getting below error:

scala> dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
<console>:36: error: type mismatch;
 found   : String("1991")
 required: java.util.Date
              dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)

                                        ^

那我要去哪里错了???

So where am I going wrong ???

推荐答案

由于您使用的是rdd,而没有df,并且具有使用简单日期检查的日期字符串,因此以下RDD的简化方法如下:

Since you are working with rdd's and no df's and you have date strings with simple date checking, the following non-complicated way for an RDD:

val rdd = sc.parallelize(Seq((69062,"FRANK","ANALYST",5646, "1991-12-03",3100.00,2001),(63679,"SANDRINE","CLERK",69062,"1990-12-18",900.00,2001)))
rdd.filter(p=>(p._5 < "1991-01-01")).foreach(println)

这篇关于小于Spark Scala RDD中日期的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆