如何通过 Python 访问 Hive? [英] How to Access Hive via Python?
问题描述
https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python 似乎已经过时.
当我将其添加到/etc/profile 时:
When I add this to /etc/profile:
export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py
然后我可以执行链接中列出的导入,除了 from hive import ThriftHive
实际上需要是:
I can then do the imports as listed in the link, with the exception of from hive import ThriftHive
which actually need to be:
from hive_service import ThriftHive
接下来示例中的端口是 10000,当我尝试时它导致程序挂起.默认的 Hive Thrift 端口是 9083,它停止了挂起.
Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.
所以我是这样设置的:
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
transport = TSocket.TSocket('<node-with-metastore>', 9083)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
client.execute("CREATE TABLE test(c1 int)")
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'
但是检查 ThriftHive.py 文件会发现该方法在 Client 类中执行.
But inspecting the ThriftHive.py file reveals the method execute within the Client class.
如何使用 Python 访问 Hive?
How may I use Python to access Hive?
推荐答案
我认为最简单的方法是使用 PyHive.
I believe the easiest way is to use PyHive.
要安装,您将需要这些库:
To install you'll need these libraries:
pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive
请注意,虽然您将库安装为 PyHive
,但您将模块导入为 pyhive
,全部小写.
Please note that although you install the library as PyHive
, you import the module as pyhive
, all lower-case.
如果您使用的是 Linux,则可能需要在运行上述之前单独安装 SASL.使用 apt-get 或 yum 或任何适用于您的发行版的软件包管理器安装软件包 libsasl2-dev.对于 Windows,GNU.org 上有一些选项,您可以下载二进制安装程序.如果您已经安装了 xcode 开发人员工具(终端中的 xcode-select --install
),则在 Mac 上 SASL 应该可用
If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install
in Terminal)
安装后,您可以像这样连接到 Hive:
After installation, you can connect to Hive like this:
from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")
现在您有了 hive 连接,您可以选择如何使用它.您可以直接查询:
Now that you have the hive connection, you have options how to use it. You can just straight-up query:
cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
use_result(result)
...或者使用连接来制作 Pandas 数据框:
...or to use the connection to make a Pandas dataframe:
import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
这篇关于如何通过 Python 访问 Hive?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!