查询pyspark中的HIVE表 [英] Query HIVE table in pyspark
问题描述
我使用CDH5.5
我在HIVE默认数据库中创建了一个表,并能够从HIVE命令查询它。
输出
hive>默认情况下使用;
确定
所用时间:0.582秒
配置单元>展示桌子;
确定
银行
所需时间:0.341秒,提取:1行
配置单元>从银行选择计数(*);
确定
542
所需时间:64.961秒,提取:1行
但是,我无法从pyspark查询表格,因为它无法识别表格。
<$从pyspark.context导入SparkContext
导入HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql(use default)
DataFrame [result:string]
$ b sqlContext.sql(show tables)。show()
+ --------- + ----------- +
| tableName | isTemporary |
+ --------- + ----------- +
+ --------- + ----------- +
sqlContext.sql(FROM bank SELECT count(*))
16/03 / 16 20:12:13 INFO parse.ParseDriver:解析命令:FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver:Parse已完成
Traceback(最新最后调用):
在< module>中,第1行的文件< stdin>
文件/usr/lib/spark/python/pyspark/sql/context.py,第552行,在sql
中返回DataFrame(self._ssql_ctx.sql(sqlQuery),self)
文件/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,第538行,在__call__
文件/ usr / lib / spark / python /pyspark/sql/utils.py,第40行,在deco
中抛出AnalysisException(s.split(':',1)[1])$ b $ b ** pyspark.sql.utils.AnalysisException:没有这样的表格银行;第1行第5行
新错误
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table(default.bank)
16/03/22 18:33:30 INFO DataNucleus.Persistence:属性datanucleus.cache.level2未知 - 将被忽略
16/03 / 22 18:33:30 INFO DataNucleus.Persistence:属性hive.metastore.integral.jdo.pushdown未知 - 将被忽略
16/03/22 18:33:44 INFO DataNucleus.Datastore:类org。 apache.hadoop.hive.metastore.model.MFieldSchema被标记为仅嵌入,因此没有自己的数据存储表。
16/03/22 18:33:44 INFO DataNucleus.Datastore:类org.apache.hadoop.hive.metastore.model.MOrder被标记为嵌入式,因此没有自己的数据存储表。
16/03/22 18:33:48 INFO DataNucleus.Datastore:类org.apache.hadoop.hive.metastore.model.MFieldSchema被标记为嵌入式,因此没有自己的数据存储表。
16/03/22 18:33:48 INFO DataNucleus.Datastore:类org.apache.hadoop.hive.metastore.model.MOrder被标记为嵌入式,因此没有自己的数据存储表。
16/03/22 18:33:50 INFO DataNucleus.Datastore:类org.apache.hadoop.hive.metastore.model.MResourceUri被标记为嵌入式,因此没有自己的数据存储表。
Traceback(最近一次调用最后一次):
在< module>中,第1行的文件< stdin>
文件/usr/lib/spark/python/pyspark/sql/context.py,第565行,在表
中返回DataFrame(self._ssql_ctx.table(tableName),self)
文件/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,第538行,在__call__
文件/ usr / lib / spark / python /pyspark/sql/utils.py,第36行,在deco
中返回f(* a,** kw)
文件/usr/lib/spark/python/lib/py4j-0.8。 2.1-src.zip/py4j/protocol.py,第300行,在get_return_value
中py4j.protocol.Py4JJavaError:调用o22.table时发生错误。
:org.apache.spark.sql.catalyst.analysis.NoSuchTableException $ b $在org.apache.spark.sql.hive.client.ClientInterface $$ anonfun $ getTable $ 1.apply(ClientInterface.scala:123 )
at org.apache.spark.sql.hive.client.ClientInterface $$ anonfun $ getTable $ 1.apply(ClientInterface.scala:123)
at scala.Option.getOrElse(Option.scala:120 )
at org.apache.spark.sql.hive.client.ClientInterface $ class.getTable(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientWrapper.getTable (ClientWrapper.scala:60)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
at org.apache.spark.sql.hive.HiveContext $$ anon $ 1.org $ apache $ spark $ sql $ catalyst $ analysis $ OverrideCatalog $$ super $ lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog $$ anonfun $ lookupRelation $ 3.apply(Catalog.scala:203)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog $$ anonfun $ lookupRelation $ 3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog $ class。在org.apache.spark.sql.hive.HiveContext $$ anon $ 1.lookupRelation(HiveContext.scala:422)
位于org.apache.spark.sql处,$ lookupRelation(Catalog.scala:203)
。 SQLContext.table(SQLContext.scala:739)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method。在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
处调用(Method.java:606)
在py4j.Gateway.invoke(Gateway.java:259)
在py4j.commands.Abstract Command.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
在java.lang.Thread.run(Thread.java:745)
谢谢
我们无法将Hive表名直接传递给Hive上下文sql方法,因为它不理解Hive表名。阅读pyspark shell中Hive表的一种方法是:从pyspark.sql中获取
import HiveContext
hive_context = HiveContext(sc )
bank = hive_context.table(default.bank)
bank.show()
要在配置单元表上运行SQL:
首先,我们需要注册从读取配置单元表获得的数据帧。
然后我们就可以运行SQL查询了。
$ $ $ $ $ $ $ $ $ $
$ bank $ sql(select * from bank_temp)。show()
I am using CDH5.5
I have a table created in HIVE default database and able to query it from the HIVE command.
Output
hive> use default;
OK
Time taken: 0.582 seconds
hive> show tables;
OK
bank
Time taken: 0.341 seconds, Fetched: 1 row(s)
hive> select count(*) from bank;
OK
542
Time taken: 64.961 seconds, Fetched: 1 row(s)
However, I am unable to query the table from pyspark as it cannot recognize the table.
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("use default")
DataFrame[result: string]
sqlContext.sql("show tables").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
+---------+-----------+
sqlContext.sql("FROM bank SELECT count(*)")
16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
raise AnalysisException(s.split(': ', 1)[1])
**pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**
New Error
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
return DataFrame(self._ssql_ctx.table(tableName), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
thanks
We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("default.bank")
bank.show()
To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.
bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()
这篇关于查询pyspark中的HIVE表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!