如何将数据从CSV加载到Hive外部表(Avro格式)? [英] How to load data from CSV to Hive external table (Avro format)?

查看:93
本文介绍了如何将数据从CSV加载到Hive外部表(Avro格式)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试构建一条直线脚本,以将HDFS中存储的CSV文件加载到Hive中的外部表.首先,我创建一个空的外部表:

I try to build a beeline script to load a CSV file stored in HDFS to an external table in Hive. First I create an empty external table:

CREATE EXTERNAL TABLE IF NOT EXISTS db.table
(column string)
STORED AS AVRO
LOCATION '/foo/bar'

之后,我检查了表是否已创建.是的.我已经使用以下命令将CSV文件加载到HDFS中:

After that I checked if the table was created. And it was. I already loaded a CSV file into HDFS with:

hdfs dfs -put test.csv /temp/raw_csv

现在,我想将此CSV数据加载(或无论插入)到外部Hive表,但是要从atat存储在Avro中(就像在create脚本中定义的一样).我在这一行尝试过:

Now I want to load/insert (whatever) this CSV data to the external Hive table, but stored in Avro fromat (like defined in the create script). I tried it with this line:

LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table

此行运行没有错误,请参见此处的cmd输出:

This line runs without an error, see the cmd output here:

..............................................INFO:编译命令(queryId = hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8):LOAD DATA INPATH'/temp/raw_csv/test.csv'覆盖到表中db.table INFO:语义分析完成INFO:返回配置单元模式:Schema(fieldSchemas:null,属性:null)INFO:已完成编译命令(queryId = hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8);耗时:0.427秒信息:正在执行命令(queryId = hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8):LOAD DATA INPATH'/temp/raw_csv/test.csv'覆盖到表中db.table INFO:以串行方式INFO启动启动[Stage-0:MOVE]任务:从hdfs://temp/raw_csv/test.csv信息将数据加载到表db.table中:在串行模式下启动任务[Stage-1:STATS] INFO:表db.tablestats:[numFiles = 1,totalSize = 45981179] INFO:完成执行命令(queryId = hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8);耗时:0.376秒INFO:确定没有受影响的行(0.87秒)

. . . . . . . . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . . . . . . . . .> INFO : Compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.427 seconds INFO : Executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table db.table from hdfs://temp/raw_csv/test.csv INFO : Starting task [Stage-1:STATS] in serial mode INFO : Table db.table stats: [numFiles=1, totalSize=45981179] INFO : Completed executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.376 seconds INFO : OK No rows affected (0.87 seconds)

但是,如果我想选择下面一行的表,则会收到错误消息:

But if I want to select the table with following line, I get an error:

select * FROM db.test;

INFO:正在编译命令(queryId = hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa):选择* FROM db.test INFO:语义分析完成INFO:返回的Hive架构:Schema(fieldSchemas:[FieldSchema(name:column,类型:字符串,注释:空)INFO:已完成编译命令(queryId = hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa);耗时:0.243秒信息:正在执行命令(queryId = hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa):选择* FROM db.test INFO:完成执行命令(queryId = hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa);耗时:0.006秒INFO:确定错误:java.io.IOException:java.io.IOException:不是数据文件.(状态=,代码= 0)

INFO : Compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:column, type:string, comment:null) INFO : Completed compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.243 seconds INFO : Executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Completed executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.006 seconds INFO : OK Error: java.io.IOException: java.io.IOException: Not a data file. (state=,code=0)

是否可以使用beeline命令将数据从CSV加载到Hive Avro表?和/或将有什么更好的方式来加载此数据?我通常会加载1到1亿行数据.

Is it possible to load data from CSV to Hive Avro table with an beeline command? And/Or what will be a better way to load this data? I normally load 1-100 million rows of data.

推荐答案

它不能那样工作.LOAD命令不会转换数据,它只是将文件移到表位置.

It does not work that way. LOAD command does not transform data, it just moves file into table location.

如果要从 CSV 转换为 AVRO ,请执行以下步骤:

If you want to convert from CSV to AVRO, then do these steps:

  1. 创建csv表

使用 hdfs dfs -put 将文件直接放置到表位置或使用

Put file directly into table location using hdfs dfs -put or use

LOAD DATA LOCAL INPATH'local/path/to/csv'覆盖表db.csv_table

LOAD DATA LOCAL INPATH 'local/path/to/csv' OVERWRITE INTO TABLE db.csv_table

创建avro表

使用Hive将csv表中的数据加载到avro

Use Hive to load data from csv table to avro

插入覆盖表avro_table从csv_table中选择...

insert overwrite table avro_table select ... from csv_table

SerDe负责读写数据文件,当您使用某些特定的SerDe创建表并选择或插入数据时将使用它.LOAD对文件模式一无所知,并且不使用SerDe.在最后一步(4)中,Hive将读取csv表并将相同的数据插入到Avro表中,它将使用两个表DDL中指定的SerDe来将数据序列化为Avro格式并从csv反序列化.

SerDe is responsible for reading and writing data files, it is being used when you create table with some specific SerDe and selecting or inserting the data. LOAD does know nothing about file schema and does not use SerDe. In the last step (4) Hive will read csv table and insert the same data into Avro table, it will use SerDe specified in both tables DDL for serializing data to Avro format and de-serializing from csv.

或者,您可以使用 csv2avro工具或其他工具将CSV文件转换为AVRO文件,然后加载放入AVRO表位置.

Alternatively you can convert CSV file to AVRO file using csv2avro tool or some other tool, then load it into AVRO table location.

如果您无意将CSV转换为Avro ,而只是将数据加载到表中以便能够选择它,那么您所需要做的就是创建CSV表并将数据文件放入它是位置(步骤1和2).

If you have no purpose to convert from CSV to Avro, only to load data to the table to be able to select it, then all you need is to create CSV table and put data files into it's location (step 1 and 2).

这篇关于如何将数据从CSV加载到Hive外部表(Avro格式)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆