读取配置单元表时spark抛出错误 [英] spark throws error when reading hive table

查看:27
本文介绍了读取配置单元表时spark抛出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 hive 中的 db.abc 中选择 *,这个 hive 表是使用 spark 加载的

i am trying to do select * from db.abc in hive,this hive table was loaded using spark

它不起作用显示错误:

错误:java.io.IOException:java.lang.IllegalArgumentException:bucketId 超出范围:-1 (state=,code=0)

Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)

当我使用以下属性时,我能够查询 hive:

when i use the following properties i was able to query for hive:

set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false; 
set hive.fetch.task.conversion=none;

现在,当我尝试使用 spark 读取同一个 hive 表 db.abc 时,我收到如下错误:

now when i try to read the same hive table db.abc using spark , i am recieving the error as below:

只有具备以下条件的客户才能访问此表能力:CONNECTORREAD、HIVEFULLACIDREAD、HIVEFULLACIDWRITE、HIVEMANAGESTATS、HIVECACHEINVALIDATE、CONNECTORWRITE.这个表可能是一个 Hive 管理的 ACID 表,或者需要一些其他的Spark 当前未实现的功能;在org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280)在org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:105)在org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83)在org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)在org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)在scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)在 scala.collection.immutable.List.foldLeft(List.scala:84) 在org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)在org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)在 scala.collection.immutable.List.foreach(List.scala:392) 在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)在org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124)在org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)在org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)在org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)在org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)在 org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) 在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49省略

Clients can access this table only if they have the following capabilities: CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE. This table may be a Hive-managed ACID table, or require some other capability that Spark currently does not implement; at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:105) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided

我需要在 spark-submit 或 shell 中添加任何属性吗?或者使用 spark 读取此 hive 表的替代方法是什么

do i need to add any properties in spark-submit or shell ? or what is the alternate way to read this hiv e table using spark

hive 表示例格式:

hive table sample format:

  CREATE TABLE `hive``(                   |
|   `c_id` decimal(11,0),etc.........       
  ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      |
| WITH SERDEPROPERTIES (  
 STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
 LOCATION                                           |
|  path= 'hdfs://gjuyada/bbts/scl/raw' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'spark.sql.create.version'='2.3.2.3.1.0.0-78',   |
|   'spark.sql.sources.provider'='orc',              |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":
[{"name":"Czz_ID","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"DzzzC_CD","type":"string","nullable":true,"metadata":{}},
{"name":"C0000_S_N","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"P_ _NB","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"C_YYYY","type":"string","nullable":true,"metadata":{}},"type":"string","nullable":true,"metadata":{}},{"name":"Cv_ID","type":"string","nullable":true,"metadata":{}},
|   'transactional'='true',                          |
|   'transient_lastDdlTime'='1574817059')  

推荐答案

您正在尝试阅读的问题 Transactional table(transactional = true)进入Spark.

The issue you are trying to reading Transactional table(transactional = true) into Spark.

官方 Spark 尚不支持 Hive-ACID 表,请获取酸性表的完整转储/增量转储到常规hive orc/parquet分区表然后使用spark读取数据.

Officially Spark not yet supported for Hive-ACID table, get a full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.

有一个 Open Jira SPARK-15348 来添加对阅读的支持Hive ACID 表.

There is a Open Jira SPARK-15348 to add support for reading Hive ACID table.

  • 如果你在 Acid 表(来自 hive)上运行 major compaction 然后火花能够读取 base_XXX 目录,但不是 delta 目录 SPARK-16996 在此 jira 中解决.

  • If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories SPARK-16996 addressed in this jira.

有一些解决方法可以使用 SPARK-LLAP 如本链接所述.

There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.

我认为HDP-3.X开始HiveWareHouseConnector 能够支持读取 HiveAcid 表.

I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

您可以将事务表的快照创建为非事务,然后从表中读取数据.

You can create an snapshot of the transactional table as non transactional and then read the data from the table.

 **`create table <non_trans> stored as orc as select * from <transactional_table>`**

更新:

1.创建一个外部hive表:

1.Create an external hive table:

 CREATE external TABLE `<ext_tab_name>`(  
       <col_name>       <data_type>....etc
           )
    stored as orc
    location '<path>';

2.然后用现有的事务表数据覆盖到上面的外部表.

2.Then overwrite to the above external table with existing transactional table data.

 insert overwrite table <ext_tab_name> select * from <transactional_tab_name>;

这篇关于读取配置单元表时spark抛出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆