关于使用Scala创建jar文件时的org.apache.spark.sql.AnalysisException错误 [英] Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

查看:480
本文介绍了关于使用Scala创建jar文件时的org.apache.spark.sql.AnalysisException错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遵循以下简单的Scala类,稍后将对其进行修改以适合某些机器学习模型.

I have following simple Scala class , which i will later modify to fit some machine learning models.

我需要从中创建一个jar文件,因为我要在amazon-emr中运行这些模型.我是这个过程的初学者.因此,我首先测试了是否可以成功导入以下csv文件并通过使用下面提到的Scala类创建一个jar文件来写入另一个文件.

I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.

csv文件如下所示,并且其中包含Date列作为变量之一.

The csv file looks like this and its include a Date column as one of the variables.

+-------------------+-------------+-------+---------+-----+
|               Date|      x1     |    y  |      x2 | x3  |       
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29|    57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41|    53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49|     61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24| 

数据集链接: https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOT

我使用intelliJ IDEA从中创建了一个jar文件.并且成功完成了.

I created a jar file out of this using intelliJ IDEA. And it was successfully done.

object jar1 {
  def main(args: Array[String]): Unit = {


      val sc: SparkSession = SparkSession.builder()
        .appName("SparkByExample")
        .getOrCreate()

       val data = sc.read.format("csv")
      .option("header","true")
      .option("inferSchema","true")
      .load(args(0))

    data.write.format("text").save(args(1))

  }

}

之后,我将这个jar文件与上面在amazon-s3中提到的csv文件一起上传,并尝试在amazon-emr的集群中运行它.

After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .

但是它失败了,并且出现了以下错误消息:

But it was failed and i got following error message :

ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;

我确定此错误与数据集中的Date变量有关.但是我不知道如何解决这个问题.

I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.

有人可以帮我解决这个问题吗?

Can anyone help me to figure this out ?

已更新:

我试图打开前面提到的没有date列的同一个csv文件.在这种情况下,我会收到此错误:

I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :

ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;

谢谢

推荐答案

稍后我会注意您将要写入文本文件. Spark的.format(text)除字符串/文本外不支持任何特定类型.因此,要实现目标,您首先需要将所有类型转换为String并存储:

As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:

    df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")

如果是这样,您可以考虑使用其他选项将数据存储为基于文件的格式,那么您就可以从类型中受益.例如,使用CSV或JSON. 这是基于您的csv的csv文件的工作代码示例.

If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON. This is working code example based on your csv file for csv.

val spark = SparkSession.builder
  .appName("Simple Application")
  .config("spark.master", "local")
  .getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._

val df = spark.read
  .format("csv")
  .option("delimiter", ",")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("dateFormat", "yyyy-MM-dd")
  .load("datat.csv")

df.printSchema()
df.show()

df.write
  .format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("delimiter", "\t")
  .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
  .option("escape", "\\")
  .save("another")

不需要自定义编码器/解码器.

There is no need custom encoder/decoder.

这篇关于关于使用Scala创建jar文件时的org.apache.spark.sql.AnalysisException错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆