首页
Python
使用Python将多行插入到Hive表中

使用Python将多行插入到Hive表中 [英] Using Python to insert multiple rows into a Hive table

查看：2011 发布时间：2018/6/12 13:52:41 python hive

本文介绍了使用Python将多行插入到Hive表中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Hive是一个数据仓库，专门用于查询和聚合驻留在HDFS上的大型数据集。

$ b

标准INSERT INTO

 
 每个语句都需要执行Map / Reduce进程。
 
 每条语句都会导致一个新文件被添加到HDFS中 - 随着时间的推移，从表中读取时这会导致性能很差。

据说，现在有一个Hive / HCatalog的Streaming API，详见 here。

 
 
 我需要使用Python将数据以速度插入Hive。我知道 pyhive 和 pyhs2 库，但它们都没有使用Streaming API。 / p> 
 
 
有没有人成功地设法让Python使用Streaming API将多行插入到Hive中，并且这是如何完成的？
 
      Hive用户可以通过脚本对数据流进行流式处理，以转换数据：
添加文件replace-nan-with-zeros.py;  
 
 
 选择
 TRANSFORM（...）
 USING'python replace-nan-with-zeros.py'
 AS（...）
 FROM some_table; 
  
这里有一个简单的Python脚本：

＃！/ usr / bin / env python import sys kFirstColumns = 7 def main（argv）：用于sys.stdin中的行： line = line.strip（）; inputs = line.split（'\t'）＃用零替代NaN outputs = [] columnIndex = 1; 用于输入值： newValue =值如果columnIndex> kFirstColumns： newValue = value.replace（'NaN'，'0.0'） outputs.append（newValue） columnIndex = columnIndex + 1 print'\\ \\ t'.join（输出）如果__name__ ==__main__： main（sys.argv [1：]）
Hive和Python

Python可以通过HiveQL TRANSFORM语句从Hive用作UDF。例如，以下HiveQL调用存储在streaming.py文件中的Python脚本。

基于Linux的HDInsight

添加文件wasb：///streaming.py;

SELECT TRANSFORM（clientid，devicemake，devicemodel）使用'streaming.py'AS （clientid string，phoneLable string，phoneHash string） FROM hivesampletable ORDER BY clientid LIMIT 50;
基于Windows的HDInsight

添加文件wasb ：///streaming.py;
SELECT TRANSFORM（clientid，devicemake，devicemodel） USING'D：\\ \\ Python27 \python.exe streaming.py'AS （clientid string，phoneLable string，phoneHash string） FROM hivesampletable ORDER BY clientid LIMIT 50;
下面是这个例子的作用：

1.文件开头处的添加文件语句将streaming.py文件添加到分布式缓存中，以便群集中的所有节点均可访问。

2。 SELECT TRANSFORM ... USING语句从混合样本中选择数据，并将clientid，devicemake和devicemodel传递给streaming.py脚本。
$ b 3. AS子句描述从streaming.py返回的字段

以下是HiveQL示例使用的streaming.py文件。
＃！/ usr / bin / env python 导入sys 导入字符串导入hashlib 而True： line = sys.stdin.readline（） if not line： break line = string.strip（line，\ n） clientid，devicemake，devicemodel = string.split（line，\ t） phone_label = devicemake +''+ devicemodel print\t.join（[clientid， phone_label，hashlib.md5（phone_label）.hexdigest（）]）
由于我们使用流媒体，脚本必须做到以下几点：

1.从STDIN读取数据。这是通过在这个例子中使用sys.stdin.readline（）来实现的。

2.使用string.strip删除尾随的换行符（line，\\\
），因为我们只是想要文本数据而不是行尾指示符。

3.当进行流处理时，单行将包含带有制表符的所有值每个值之间的字符。因此，可以使用string.split（line，\ t）来分割每个标签处的输入，只返回字段。

4.当处理完成时，必须将输出作为单行写入STDOUT，并在每个字段之间加上一个制表符。这是通过使用打印\ t.join（[clientid，phone_label，hashlib.md5（phone_label）.hexdigest（）]）。

5.完成所有这些都发生在一个while循环中，直到没有行被读取为止，此时break会退出循环并且脚本终止。

除此之外，脚本只是连接devicemake和devicemodel的输入值，并计算连接值的散列值。非常简单，但它描述了从Hive调用的任何Python脚本应如何工作的基本知识：循环，读取输入，直到不再有任何输入，在选项卡处分开每行输入，处理，写入一行制表符分隔输出。
Hive is a data warehouse designed for querying and aggregating large datasets that reside on HDFS.

The standard INSERT INTO syntax performs poorly because:

Each statement required a Map/Reduce process to be executed.

Each statement will result in a new file being added to HDFS - over time this will lead to very poor performance when reading from the table.

With that said, there is now a Streaming API for Hive / HCatalog, as detailed here.

I am faced with the need to insert data at velocity into Hive, using Python. I am aware of the pyhive and pyhs2 libraries, but neither of them appears to make use of the Streaming API.

Has anyone successfully managed to get Python to insert many rows into Hive using the Streaming API, and how was this done?

I look forward to your insights!
解决方案
Hive user can stream table through script to transform that data: ADD FILE replace-nan-with-zeros.py;
SELECT TRANSFORM (...) USING 'python replace-nan-with-zeros.py' AS (...) FROM some_table;
Here a simple Python script:
#!/usr/bin/env python import sys kFirstColumns= 7 def main(argv): for line in sys.stdin: line = line.strip(); inputs = line.split('\t') # replace NaNs with zeros outputs = [ ] columnIndex = 1; for value in inputs: newValue = value if columnIndex > kFirstColumns: newValue = value.replace('NaN','0.0') outputs.append(newValue) columnIndex = columnIndex + 1 print '\t'.join(outputs) if __name__ == "__main__": main(sys.argv[1:])
Hive and Python

Python can be used as a UDF from Hive through the HiveQL TRANSFORM statement. For example, the following HiveQL invokes a Python script stored in the streaming.py file.

Linux-based HDInsight

add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel) USING 'streaming.py' AS (clientid string, phoneLable string, phoneHash string) FROM hivesampletable ORDER BY clientid LIMIT 50;
Windows Based HDInsight

add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel) USING 'D:\Python27\python.exe streaming.py' AS (clientid string, phoneLable string, phoneHash string) FROM hivesampletable ORDER BY clientid LIMIT 50;
Here's what this example does:

1.The add file statement at the beginning of the file adds the streaming.py file to the distributed cache, so it's accessible by all nodes in the cluster.

2.The SELECT TRANSFORM ... USING statement selects data from the hivesampletable, and passes clientid, devicemake, and devicemodel to the streaming.py script.

3.The AS clause describes the fields returned from streaming.py

Here's the streaming.py file used by the HiveQL example.
#!/usr/bin/env python import sys import string import hashlib while True: line = sys.stdin.readline() if not line: break line = string.strip(line, "\n ") clientid, devicemake, devicemodel = string.split(line, "\t") phone_label = devicemake + ' ' + devicemodel print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])
Since we are using streaming, this script has to do the following:

1.Read data from STDIN. This is accomplished by using sys.stdin.readline() in this example.

2.The trailing newline character is removed using string.strip(line, "\n "), since we just want the text data and not the end of line indicator.

3.When doing stream processing, a single line contains all the values with a tab character between each value. So string.split(line, "\t") can be used to split the input at each tab, returning just the fields.

4.When processing is complete, the output must be written to STDOUT as a single line, with a tab between each field. This is accomplished by using print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()]).

5.This all occurs within a while loop, that will repeat until no line is read, at which point break exits the loop and the script terminates.

Beyond that, the script just concatenates the input values for devicemake and devicemodel, and calculates a hash of the concatenated value. Pretty simple, but it describes the basics of how any Python script invoked from Hive should function: Loop, read input until there is no more, break each line of input apart at the tabs, process, write a single line of tab delimited output.

这篇关于使用Python将多行插入到Hive表中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

相关文章

将数据插入到Hive表中;

将数据插入到hive表中;

使用表变量将多行插入到SQL Server表中;

如何将数据插入到Hive（0.13.1）表中？;

如何将时间戳插入到Hive表中？;

如何将数据插入到 Hive(0.13.1) 表中?;

将数据插入 Hive 表;

使用元组的 Python 列表将多行插入到数据库中;

使用Hive（JSON文件）将数据插入到Hbase;

将多行默认值插入表中;

使用范围标识值如何将多行插入到其他表中;

mysql将多行查询结果插入表;

我想从表中插入多行到SqlServer但有问题;

如何将Spark DataFrame插入Hive内部表?;

如何将 Spark DataFrame 插入 Hive 内部表?;

Hadoop / Hive - 将单行分成多行;

Bigquery - 通过python将新数据行插入到表中;

使用 codeigniter 将多条记录插入到表中;

使用ibatis将HashMap值插入到表中;

使用 ibatis 将 HashMap 值插入到表中;

HIVE-插入覆盖vs删除表+创建表+插入;

是否可以使用Python将工作表插入到现有工作簿中?;

R使用RJDBC写表到Hive;

将图像插入到oracle表中;

将操作插入到bigquery表中;

Python最新文章

类型错误：只有长度为1的阵列可以尝试拟合指数的数据转换到Python标量;

bs4.FeatureNotFound：找不到一棵树建设者您所要求的功能：LXML。你需要安装一个解析器库？;

系列的真值是不明确的。使用a.empty，a.bool（），a.item（），a.any（）或a.all（）;

（unicode错误）'unicodeescape'编解码器无法解码位置2-3中的字节：truncated \UXXXXXXXX escape;

将pandas dataframe中的列从int转换为string;

Python：由实例对象调用方法：“missing 1 required positional argument：'self'”;

Sparksql过滤与多个条件（与where子句中选择）;

JSONDe codeError：期待值：1行1列（CHAR 0）;

Cmake不能找到Python库;

Python - 将Dataframe中的所有项目转换为字符串;

热门教程

Java教程

Apache ANT 教程

Kali Linux教程

JavaScript教程

JavaFx教程

MFC 教程

Apache HTTP客户端教程

Microsoft Visio 教程

热门工具

Java 在线工具

C(GCC) 在线工具

PHP 在线工具

C# 在线工具

Python 在线工具

MySQL 在线工具

VB.NET 在线工具

Lua 在线工具

Oracle 在线工具

C++(GCC) 在线工具

Go 在线工具

Fortran 在线工具

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆

友情链接： IT屋 Chrome插件谷歌浏览器插件

IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除.