从Apache Spark连接到Hive [英] Connec to Hive from Apache Spark
问题描述
我有一个简单的程序,正在独立Cloudera VM上运行.我已经在Hive中创建了一个托管表,我想在Apache spark中读取该表,但是尚未建立与hive的初始连接.请告知.
I have a simple program that I'm running on Standalone Cloudera VM. I have created a managed table in Hive , which I want to read in Apache spark, but the initial connection to hive is not being established. Please advise.
我正在IntelliJ中运行此程序,我已经将hive-site.xml从我的/etc/hive/conf复制到了/etc/spark/conf,即使spark-job没有连接到Hive metastore >
I'm running this program in IntelliJ, I have copied hive-site.xml from my /etc/hive/conf to /etc/spark/conf, even then the spark-job is not connecting to Hive metastore
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
输出如下,期望在其中看到"Employee table",以便我查询.由于我在Standa-alone上运行,因此配置单元metastore在本地mySQL服务器中.
The output is as below, where is expect to see "Employee table" , so that I can query. Since I'm running on Standa-alone , hive metastore is in Local mySQL server.
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist = true是Hive Metastore的配置
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true is the configuration for Hive metastore
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
添加了Spark日志
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to:
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to:
17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); groups with view permissions: Set(); users with modify permissions: Set(cloudera); groups with modify permissions: Set()
17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=*
17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: default
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: global_temp
17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Process finished with exit code 0
更新
/usr/lib/hive/conf/hive-site.xml不在类路径中,因此它没有读取表,在将其添加到类路径中后,它工作得很好...由于我是从IntelliJ运行的,所以这个问题..在生产中,spark-conf文件夹将链接到hive-site.xml ...
/usr/lib/hive/conf/hive-site.xml was not in the classpath so it was not reading the tables, after adding it in the classpath it worked fine ... Since I was running from IntelliJ I have this problem .. in production the spark-conf folder will have link to hive-site.xml ...
推荐答案
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
这表明您未连接到远程配置单元元存储(已设置为MySQL),并且XML文件未正确放在类路径中.
This is a hint that you're not connected to the remote hive metastore (that you've set as MySQL), and the XML file is not correctly on your classpath.
在创建SparkSession之前,您可以在不使用XML的情况下以编程方式进行操作
You can do it programmatically without XML before you make a SparkSession
System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
这篇关于从Apache Spark连接到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!