GoogleHadoopFileSystem不能转换为Hadoop的系统中呢？ [英] GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

查看：505 发布时间：2016/5/22 15:56:40 apache-spark google-hadoop

本文介绍了GoogleHadoopFileSystem不能转换为Hadoop的系统中呢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set

SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'

部署与bdutil很好;然而，试图调用SqlContext.parquetFile（GS：//my_bucket/some_data.parquet）时，它运行到以下异常：

deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception:

 java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)

和什么让我感到困惑的是，GoogleHadoopFileSystem应该是org.apache.hadoop.fs.FileSystem的一个子类，我甚至在同一火花shell实例验证：

And what confused me is that GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem, and I even verified in the same spark-shell instance:

scala> var gfs = new com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem()
gfs: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@46f105c

scala> gfs.isInstanceOf[org.apache.hadoop.fs.FileSystem]
res3: Boolean = true

scala> gfs.asInstanceOf[org.apache.hadoop.fs.FileSystem]
res4: org.apache.hadoop.fs.FileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@46f105c

我错过了什么，任何解决方法吗？在此先感谢！

Did I miss anything, any workaround? Thanks in advance!

更新：这是我的bdutil（1.3.1版）设置部署：

UPDATE: this is my bdutil (version 1.3.1) setting for deployment:

import_env hadoop2_env.sh
import_env extensions/spark/spark_env.sh
CONFIGBUCKET="my_conf_bucket"
PROJECT="my_proj"
GCE_IMAGE='debian-7-backports'
GCE_MACHINE_TYPE='n1-highmem-4'
GCE_ZONE='us-central1-f'
GCE_NETWORK='my-network'
GCE_MASTER_MACHINE_TYPE='n1-standard-2'
PREEMPTIBLE_FRACTION=1.0
PREFIX='my-hadoop'
NUM_WORKERS=8
USE_ATTACHED_PDS=true
WORKER_ATTACHED_PDS_SIZE_GB=200
MASTER_ATTACHED_PD_SIZE_GB=200
HADOOP_TARBALL_URI="gs://hadoop-dist/hadoop-2.6.0.tar.gz"
SPARK_MODE="yarn-client"
SPARK_HADOOP2_TARBALL_URI='gs://my_conf_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'

推荐答案

简答

实际上，它是关系到IsolatedClientLoader，我们已经找到了根源，验证修复。我申请 https://issues.apache.org/jira/browse/SPARK-9206跟踪此问题，并成功地构建从我的叉子干净星火压缩包用一个简单的修复： HTTPS ：//github.com/apache/spark/pull/7549

Short Answer

有一些短期的选项：

使用星火1.3.1了。

在bdutil部署，使用HDFS作为默认文件系统（ - default_fs = HDFS ）;你仍然可以直接指定 GS：在你的工作// 的路径，只是HDFS将用于中间数据和暂存文件。有在此模式下使用原始蜂巢一些小的不兼容，虽然。

使用原始 VAL sqlContext =新org.apache.spark.sql.SQLContext（SC），而不是HiveContext如果你不需要HiveContext功能。

git的克隆https://github.com/dennishuo/spark 并运行 ./ make-distribution.sh --name我的定制-spark --tgz --skip-java的测试-Pyarn -Phadoop-2.6 -Dhadoop.version = 2.6.0 -Phive -Phive-thriftserver 让你可以在你的bdutil的指定一个新的压缩包 spark_env.sh 。

There are a few short-term options:

长的答案

我们已经证实，它只是体现在 fs.default.name 和 fs.defaultFS 设置为一个 GS：无论// 路径是否试图加载从 parquetFile（GS：// ......）的路径或 parquetFile（HDFS：// ...），并在 fs.default.name 和 fs.defaultFS 被设置为一个HDFS路径，无论从HDFS和GCS加载数据工作正常。这也是特定于火花1.4+目前，并没有在火花1.3.1以上present

Long Answer

回归似乎在 https://github.com/apache/spark/被引入提交/ 9ac8393663d759860c67799e000ec072ced76493 这实际上修复了之前相关的类加载问题，请 SPARK-8368 。虽然修复本身就是正常情况下是正确，有一个方法<一href=\"https://github.com/apache/spark/blob/3cca0e938afb384c096625f30ca862e67796788d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L119\"相对=nofollow> IsolatedClientLoader.isSharedClass 的用于确定要使用的类加载器，并与上述交互致力于打破GoogleHadoopFileSystem类加载。

We've verified that it only manifests when fs.default.name and fs.defaultFS are set to a gs:// path regardless of whether trying to load a path from parquetFile("gs://...") or parquetFile("hdfs://..."), and when fs.default.name and fs.defaultFS are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.

在该文件下面的线包括在 com.google一切。* 作为确实加载共享库，因为番石榴和可能protobuf的依赖关系的共同类，但遗憾的是GoogleHadoopFileSystem应该被加载在这种情况下，蜂巢班，就像 org.apache.hadoop.hdfs.DistributedFileSystem 。我们只是碰巧不幸共享 com.google。* 包的命名空间。

The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.

The following lines in that file include everything under com.google.* as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem. We just happen to unluckily share the com.google.* package namespace.

这可以通过添加以下行 $ {} SPARK_INSTALL /conf/log4j.properties 验证

This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties:

log4j.logger.org.apache.spark.sql.hive.client=DEBUG

和输出显示：

...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem

这篇关于GoogleHadoopFileSystem不能转换为Hadoop的系统中呢？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

GoogleHadoopFileSystem不能转换为Hadoop的系统中呢？ [英] GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

GoogleHadoopFileSystem不能转换为Hadoop的系统中呢？ [英] GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭