GoogleHadoopFileSystem不能转换为Hadoop的系统中呢? [英] GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

查看:505
本文介绍了GoogleHadoopFileSystem不能转换为Hadoop的系统中呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原来的问题是试图在谷歌云中部署火花1.4 。下载并设置完成后

The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set

SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'

部署与bdutil很好;然而,试图调用SqlContext.parquetFile(GS://my_bucket/some_data.parquet)时,它运行到以下异常:

deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception:

 java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)

和什么让我感到困惑的是,GoogleHadoopFileSystem应该是org.apache.hadoop.fs.FileSystem的一个子类,我甚至在同一火花shell实例验证:

And what confused me is that GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem, and I even verified in the same spark-shell instance:

scala> var gfs = new com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem()
gfs: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@46f105c

scala> gfs.isInstanceOf[org.apache.hadoop.fs.FileSystem]
res3: Boolean = true

scala> gfs.asInstanceOf[org.apache.hadoop.fs.FileSystem]
res4: org.apache.hadoop.fs.FileSystem = com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@46f105c

我错过了什么,任何解决方法吗?在此先感谢!

Did I miss anything, any workaround? Thanks in advance!

更新:这是我的bdutil(1.3.1版)设置部署:

UPDATE: this is my bdutil (version 1.3.1) setting for deployment:

import_env hadoop2_env.sh
import_env extensions/spark/spark_env.sh
CONFIGBUCKET="my_conf_bucket"
PROJECT="my_proj"
GCE_IMAGE='debian-7-backports'
GCE_MACHINE_TYPE='n1-highmem-4'
GCE_ZONE='us-central1-f'
GCE_NETWORK='my-network'
GCE_MASTER_MACHINE_TYPE='n1-standard-2'
PREEMPTIBLE_FRACTION=1.0
PREFIX='my-hadoop'
NUM_WORKERS=8
USE_ATTACHED_PDS=true
WORKER_ATTACHED_PDS_SIZE_GB=200
MASTER_ATTACHED_PD_SIZE_GB=200
HADOOP_TARBALL_URI="gs://hadoop-dist/hadoop-2.6.0.tar.gz"
SPARK_MODE="yarn-client"
SPARK_HADOOP2_TARBALL_URI='gs://my_conf_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'

推荐答案

简答

实际上,它是关系到IsolatedClientLoader,我们已经找到了根源,验证修复。我申请 https://issues.apache.org/jira/browse/SPARK-9206跟踪此问题,并成功地构建从我的叉子干净星火压缩包用一个简单的修复: HTTPS ://github.com/apache/spark/pull/7549

Short Answer

有一些短期的选项:


  1. 使用星火1.3.1了。

  2. 在bdutil部署,使用HDFS作为默认文件系统( - default_fs = HDFS );你仍然可以直接指定 GS:在你的工作// 的路径,只是HDFS将用于中间数据和暂存文件。有在此模式下使用原始蜂巢一些小的不兼容,虽然。

  3. 使用原始 VAL sqlContext =新org.apache.spark.sql.SQLContext(SC),而不是HiveContext如果你不需要HiveContext功能。

  4. git的克隆https://github.com/dennishuo/spark 并运行 ./ make-distribution.sh --name我的定制-spark --tgz --skip-java的测试-Pyarn -Phadoop-2.6 -Dhadoop.version = 2.6.0 -Phive -Phive-thriftserver 让你可以在你的bdutil的指定一个新的压缩包 spark_env.sh

There are a few short-term options:

长的答案

我们已经证实,它只是体现在 fs.default.name fs.defaultFS 设置为一个 GS:无论// 路径是否试图加载从 parquetFile(GS:// ......)的路径 parquetFile(HDFS:// ...),并在 fs.default.name fs.defaultFS 被设置为一个HDFS路径,无论从HDFS和GCS加载数据工作正常。这也是特定于火花1.4+目前,并没有在火花1.3.1以上present

Long Answer

回归似乎在 https://github.com/apache/spark/被引入提交/ 9ac8393663d759860c67799e000ec072ced76493 这实际上修复了之前相关的类加载问题,请 SPARK-8368 。虽然修复本身就是正常情况下是正确,有一个方法<一href=\"https://github.com/apache/spark/blob/3cca0e938afb384c096625f30ca862e67796788d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L119\"相对=nofollow> IsolatedClientLoader.isSharedClass 的用于确定要使用的类加载器,并与上述交互致力于打破GoogleHadoopFileSystem类加载。

We've verified that it only manifests when fs.default.name and fs.defaultFS are set to a gs:// path regardless of whether trying to load a path from parquetFile("gs://...") or parquetFile("hdfs://..."), and when fs.default.name and fs.defaultFS are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.

在该文件下面的线包括在 com.google一切。* 作为确实加载共享库,因为番石榴和可能protobuf的依赖关系的共同类 ,但遗憾的是GoogleHadoopFileSystem应该被加载在这种情况下,蜂巢班,就像 org.apache.hadoop.hdfs.DistributedFileSystem 。我们只是碰巧不幸共享 com.google。* 包的命名空间。

The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.

The following lines in that file include everything under com.google.* as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem. We just happen to unluckily share the com.google.* package namespace.

这可以通过添加以下行 $ {} SPARK_INSTALL /conf/log4j.properties 验证

This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties:

log4j.logger.org.apache.spark.sql.hive.client=DEBUG

和输出显示:

...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem

这篇关于GoogleHadoopFileSystem不能转换为Hadoop的系统中呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆