通过远程 Spark 作业出错:java.lang.IllegalAccessError:class org.apache.hadoop.hdfs.web.HftpFileSystem [英] Error through remote Spark Job: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem
问题描述
我正在尝试使用 Spark HDInsight 集群 (HDI 4.0) 通过 IntelliJ 运行远程 Spark 作业.在我的 Spark 应用程序中,我尝试使用内置 readStream
函数的 Spark 结构化流从 Azure blob 存储中的镶木地板文件文件夹读取输入流.
I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream
function.
当我在连接到 HDInsight 群集的 Zeppelin 笔记本上运行代码时,该代码按预期工作.但是,当我将 Spark 应用程序部署到集群时,遇到以下错误:
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem 无法访问其超级接口 org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
随后,我无法从 blob 存储中读取任何数据.
Subsequently, I am unable to read any data from blob storage.
我在网上查到的小资料表明这是Spark和Hadoop版本冲突导致的.该应用程序使用为 Hadoop 2.7
预构建的 Spark 2.4
运行.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4
prebuilt for Hadoop 2.7
.
为了解决这个问题,我通过 ssh 进入集群的每个头节点和工作节点,并手动将 Hadoop 依赖项从 3.1.x
降级到 2.7.3
以匹配版本在我本地的 spark/jars
文件夹中.这样做之后,我就可以成功部署我的应用程序了.无法将集群从 HDI 4.0 降级,因为它是唯一可以支持 Spark 2.4
的集群.
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3
from 3.1.x
to match the version in my local spark/jars
folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4
.
总而言之,问题可能是我使用的是为 Hadoop 2.7
预构建的 Spark 下载吗?有没有更好的方法来解决这个冲突,而不是手动降级集群节点上的 Hadoop 版本或更改我正在使用的 Spark 版本?
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7
? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
推荐答案
在对我之前尝试过的一些方法进行故障排除后,我遇到了以下修复:
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
在我的 pom.xml
中,我排除了 spark-core
jar 自动导入的 hadoop-client
依赖项.此依赖项是版本 2.6.5
,它与集群的 Hadoop 版本冲突.相反,我导入了我需要的版本.
In my pom.xml
I excluded the hadoop-client
dependency automatically imported by the spark-core
jar. This dependency was version 2.6.5
which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
进行此更改后,我遇到了错误java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0
.进一步的研究表明,这是由于我本地机器上的 Hadoop 配置存在问题.根据 这篇文章的建议,我修改了winutils.exe
我在 C://winutils/bin
下的版本是我需要的版本,并添加了相应的 hadoop.dll
.进行这些更改后,我能够按预期从 Blob 存储中成功读取数据.
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0
. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe
version I had under C://winutils/bin
to be the version I required and also added the corresponding hadoop.dll
. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR问题是自动导入的 hadoop-client
依赖项,通过排除它来修复 &在 C://winutils/bin
下添加新的 winutils.exe
和 hadoop.dll
.
TLDR
Issue was the auto imported hadoop-client
dependency which was fixed by excluding it & adding the new winutils.exe
and hadoop.dll
under C://winutils/bin
.
这不再需要降级 HDInsight 群集中的 Hadoop 版本或更改我下载的 Spark 版本.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.
这篇关于通过远程 Spark 作业出错:java.lang.IllegalAccessError:class org.apache.hadoop.hdfs.web.HftpFileSystem的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!