spark scala avro写入失败,并出现AbstractMethodError [英] spark scala avro write fails with AbstractMethodError

查看:84
本文介绍了spark scala avro写入失败,并出现AbstractMethodError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

im尝试从avro读取数据,按字段对数据重新分区,然后将其保存为avro格式.下面是我的示例代码.在调试过程中,我无法在数据框上执行show(10).它失败,并显示以下错误.有人可以帮我了解我的代码行在做什么错吗?

im trying to read data from avro, repartition the data by a field and save it as avro format. below is my sample code. during debugging process, I cannot do a show(10) on my dataframe. it fails with the following error. can someone please help me understand what im doing wrong in my code lines?

代码:

import org.apache.spark.sql.avro._

val df = spark.read.format("avro").load("s3://test-bucekt/source.avro")

df.show(10)
df.write.partitionBy("partitioning_column").format("avro").save("s3://test-bucket/processed/processed.avro")

显示和写入均失败,并出现以下错误:

both show and write fails with the following error:

java.lang.AbstractMethodError: org.apache.spark.sql.avro.AvroFileFormat.shouldPrefetchData(Lorg/apache/spark/sql/SparkSession;Lorg/apache/spark/sql/types/StructType;Lorg/apache/spark/sql/types/StructType;)Z
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
  at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:404)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
  ... 85 elided

推荐答案

这是由于emr-5.28.0中FileFormat的意外二进制不兼容更改引起的,此更改将在emr-5.29.0出现时修复.幸运的是,对于Avro格式,有一个简单的解决方法可以在emr-5.28.0中使用.如果您使用与EMR捆绑在一起的spark-avro jar,则可以使用Maven Central的spark-avro版本,而不使用它.也就是说,使用-jars/usr/lib/spark/external/lib代替-打包org.apache.spark:spark-avro_2.11:2.4.4 之类的东西./spark-avro.jar.

This is a caused by an unintendedly binary-incompatible change to FileFormat in emr-5.28.0, which will be fixed when emr-5.29.0 comes out. Fortunately, for the Avro format, there is an easy workaround that can be used in emr-5.28.0. Instead of using the version of spark-avro from Maven Central, it will work if you use the spark-avro jar bundled with EMR. That is, instead of something like --packages org.apache.spark:spark-avro_2.11:2.4.4, use --jars /usr/lib/spark/external/lib/spark-avro.jar.

这篇关于spark scala avro写入失败,并出现AbstractMethodError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆