从Apache Spark连接到Hive [英] Connec to Hive from Apache Spark

查看:117
本文介绍了从Apache Spark连接到Hive的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的程序,正在独立Cloudera VM上运行.我已经在Hive中创建了一个托管表,我想在Apache spark中读取该表,但是尚未建立与hive的初始连接.请告知.

I have a simple program that I'm running on Standalone Cloudera VM. I have created a managed table in Hive , which I want to read in Apache spark, but the initial connection to hive is not being established. Please advise.

我正在IntelliJ中运行此程序,我已经将hive-site.xml从我的/etc/hive/conf复制到了/etc/spark/conf,即使spark-job没有连接到Hive metastore

I'm running this program in IntelliJ, I have copied hive-site.xml from my /etc/hive/conf to /etc/spark/conf, even then the spark-job is not connecting to Hive metastore

 public static void main(String[] args) throws AnalysisException {
         String master = "local[*]";

         SparkSession sparkSession = SparkSession
                 .builder().appName(ConnectToHive.class.getName())
                 .config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
                 .enableHiveSupport()
                 .master(master).getOrCreate();

         SparkContext context = sparkSession.sparkContext();
         context.setLogLevel("ERROR");

         SQLContext sqlCtx = sparkSession.sqlContext();

         HiveContext hiveContext = new HiveContext(sparkSession);
         hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");

         hiveContext.sql("SHOW DATABASES").show();
         hiveContext.sql("SHOW TABLES").show();

         sparkSession.close();
     }

输出如下,期望在其中看到"Employee table",以便我查询.由于我在Standa-alone上运行,因此配置单元metastore在本地mySQL服务器中.

The output is as below, where is expect to see "Employee table" , so that I can query. Since I'm running on Standa-alone , hive metastore is in Local mySQL server.

 +------------+
 |databaseName|
 +------------+
 |     default|
 +------------+

 +--------+---------+-----------+
 |database|tableName|isTemporary|
 +--------+---------+-----------+
 +--------+---------+-----------+

jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist = true是Hive Metastore的配置

jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true is the configuration for Hive metastore

 hive> show databases;
 OK
 default
 sxm
 temp
 Time taken: 0.019 seconds, Fetched: 3 row(s)
 hive> use default;
 OK
 Time taken: 0.015 seconds
 hive> show tables;
 OK
 employee
 Time taken: 0.014 seconds, Fetched: 1 row(s)
 hive> describe formatted employee;
 OK
 # col_name             data_type               comment             

 id                     string                                      
 firstname              string                                      
 lastname               string                                      
 addresses              array<struct<street:string,city:string,state:string>>                       

 # Detailed Table Information        
 Database:              default                  
 Owner:                 cloudera                 
 CreateTime:            Tue Jul 25 06:33:01 PDT 2017     
 LastAccessTime:        UNKNOWN                  
 Protect Mode:          None                     
 Retention:             0                        
 Location:              hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee     
 Table Type:            MANAGED_TABLE            
 Table Parameters:       
    transient_lastDdlTime   1500989581          

 # Storage Information       
 SerDe Library:         org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
 InputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
 OutputFormat:          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
 Compressed:            No                       
 Num Buckets:           -1                       
 Bucket Columns:        []                       
 Sort Columns:          []                       
 Storage Desc Params:        
    serialization.format    1                   
 Time taken: 0.07 seconds, Fetched: 29 row(s)
 hive> 

添加了Spark日志

 log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
 17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
 17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
 17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to: 
 17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to: 
 17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(cloudera); groups with view permissions: Set(); users  with modify permissions: Set(cloudera); groups with modify permissions: Set()
 17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
 17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
 17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
 17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
 17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
 17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
 17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
 17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
 17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
 17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
 17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
 17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
 17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
 17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
 17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
 17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
 17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
 17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
 17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
 17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
 17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
 17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
 17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
 17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
 17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
 17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
 17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_all_databases   
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
 17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
 17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
 17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
 17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
 17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_database: default   
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_database: global_temp   
 17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
 +------------+
 |databaseName|
 +------------+
 |     default|
 +------------+

 +--------+---------+-----------+
 |database|tableName|isTemporary|
 +--------+---------+-----------+
 +--------+---------+-----------+


 Process finished with exit code 0

更新

/usr/lib/hive/conf/hive-site.xml不在类路径中,因此它没有读取表,在将其添加到类路径中后,它工作得很好...由于我是从IntelliJ运行的,所以这个问题..在生产中,spark-conf文件夹将链接到hive-site.xml ...

/usr/lib/hive/conf/hive-site.xml was not in the classpath so it was not reading the tables, after adding it in the classpath it worked fine ... Since I was running from IntelliJ I have this problem .. in production the spark-conf folder will have link to hive-site.xml ...

推荐答案

 17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY

这表明您未连接到远程配置单元元存储(已设置为MySQL),并且XML文件未正确放在类路径中.

This is a hint that you're not connected to the remote hive metastore (that you've set as MySQL), and the XML file is not correctly on your classpath.

在创建SparkSession之前,您可以在不使用XML的情况下以编程方式进行操作

You can do it programmatically without XML before you make a SparkSession

System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");

如何在SparkSQL吗?

这篇关于从Apache Spark连接到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆