Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时 [英] Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

查看：1965 发布时间：2018/6/12 13:45:40 apache-spark hive ibm-cloud yarn biginsights

本文介绍了Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在访问Hive表的Cloud 4.2 Enterprise上的BigInsights上运行pyspark脚本。

首先，我创建配置单元表：

  [biadmin @ bi4c-xxxxx-mastermanager〜] $ hive 
 hive> CREATE TABLE pokes（foo INT，bar STRING）; 
确定
所需时间：2.147秒
配置单元> LOAD DATA LOCAL INPATH'/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt'OVERWRITE INTO TABLE pokes; 
将数据加载到表格default.pokes 
表格default.pokes统计：[numFiles = 1，numRows = 0，totalSize = 5812，rawDataSize = 0] 
 OK 
所需时间： 0.49秒
 hive>

然后我创建一个简单的pyspark脚本：

< pre $ {code> [biadmin @ bi4c-xxxxxx-mastermanager〜] $ cat test_pokes.py
from pyspark import SparkContext

sc = SparkContext（）

from pyspark.sql import HiveContext
hc = HiveContext（sc）
$ b pokesRdd = hc.sql（'select * from pokes'）
print（pokesRdd。 collect（））

我试图用：

  [biadmin @ bi4c-xxxxxx-mastermanager〜] $ spark-submit \ 
  -  master yarn-cluster \ 
 --deploy-mode cluster \ 
 --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar，\ 
 /usr/iop/4.2.0.0/ hive / lib / datanucleus-core-3.2.10.jar，\ 
 /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \ 
 test_pokes。 py

但是，我遇到错误：

 追踪（最近一次调用最后一次）：
在< module>文件中，第8行的文件test_pokes.py 
 pokesRdd = hc.sql（'select * from pokes'）
文件/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py ，580行sql 
文件/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py，行813，在__call__ 
文件/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py，第51行，在deco 
中pyspark.sql.utils.AnalysisException：u'表未找到：pokes;第1行pos 14'
 LogType结尾：stdout

如果我运行spark-submit standalone ，我可以看到表存在好：

  [biadmin @ bi4c-xxxxxx-mastermanager〜] $ spark-submit test_pokes.py 
 ... 
 ... 
 16/12/21 13:09:13 INFO执行器：阶段0.0中的完成任务0.0（TID 0）。 18962字节结果发送给驱动
 16/12/21 13:09:13信息TaskSetManager：在本地主机（1/1）上的阶段0.0（TID 0）完成任务0.0 168 ms中
 16/12 / 21 13:09:13 INFO TaskSchedulerImpl：从池中删除已完成任务的TaskSet 0.0 
 16/12/21 13:09:13 INFO DAGScheduler：ResultStage 0（在/ home / biadmin / test_pokes处收集.py：9）在0.179 s 
 16/12/21 13:09:13完成INFO DAGScheduler：作业0完成：收集在/home/biadmin/test_pokes.py:9，花费0.236558 s 
行（foo = 238，bar = u'val_238'），行（foo = 86，bar = u'val_86'），行（foo = 311，bar = u'val_311'）
 ... 
 ...

查看我之前与此问题相关的问题：配置单元火花纱线群集作业失败，并显示：ClassNotFoundException：org .datanucleus.api.jdo.JDOPersistenceManagerFactory

<这个问题类似于这个问题： Spark可以从pyspark访问Hive表，但不能从spark-submit 访问。然而，不像那个问题，我使用HiveContext。

更新：请参阅此处以获取最终解决方案 https://stackoverflow.com/a/41272260/1033422

解决方案

这是因为spark-submit作业无法找到 hive-site.xml ，所以无法连接到Hive Metastore。请在您的spark-submit命令中添加 - files /usr/iop/4.2.0.0/hive/conf/hive-site.xml 。

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.

First I create the hive table:

[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>

Then I create a simple pyspark script:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext

sc = SparkContext()

from pyspark.sql import HiveContext
hc = HiveContext(sc)

pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )

I attempt to execute with:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
    --master yarn-cluster \
    --deploy-mode cluster \
    --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
    test_pokes.py

However, I encounter the error:

Traceback (most recent call last):
  File "test_pokes.py", line 8, in <module>
    pokesRdd = hc.sql('select * from pokes')
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout

If I run spark-submit standalone, I can see the table exists ok:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…

See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.

Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422

解决方案

This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.

这篇关于Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时 [英] Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时 [英] Spark Hive reporting pyspark.sql.utils.AnalysisException: u&#39;Table not found: XXX&#39; when run on yarn cluster

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark Hive报告pyspark.sql.utils.AnalysisException：u'Table未找到：XXX'在纱群上运行时 [英] Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

登录关闭