Hive使用来自嵌套子目录的输入创建表 [英] Hive create table with inputs from nested sub-directories
问题描述
我在HDFS中使用Avro格式的数据,文件路径为: / data / logs / [文件夹名称] / [文件名称] .avro
。我想在所有这些日志文件上创建一个Hive表,即所有格式为 / data / logs / * / *
的文件。 (它们都基于相同的Avro模式。)
我使用标记 mapred.input.dir.recursive = true
:
CREATE EXTERNAL TABLE default.testtable
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
作为INPUTFORMAT存储
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION'hdfs://.../data/*/*'
TBLPROPERTIES(
'avro.schema.url'='hdfs://.../schema.avsc')
除非将 LOCATION
更少嵌套,即 LOCATION
的嵌套路径不会有问题。
我希望能够提供源代码来自所有这些不同[文件夹名称]文件夹的数据。 如何使嵌套目录中的递归输入选择更进一步? 使用这个配置单元设置以启用递归目录:
set hive.mapred.supports.subdirectories = TRUE;
set mapred.input.dir.recursive = TRUE;
创建外部表并指定根目录作为位置:
LOCATION'hdfs://.../data'
您将能够从表格位置和所有子目录查询数据
I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro
. I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/*
. (They're all based on the same Avro schema.)
I'm running the below query with flag mapred.input.dir.recursive=true
:
CREATE EXTERNAL TABLE default.testtable
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://.../data/*/*'
TBLPROPERTIES (
'avro.schema.url'='hdfs://.../schema.avsc')
The table ends up being empty unless I change LOCATION
to be less nested, i.e. to be 'hdfs://.../data/[foldername]/'
with a certain foldername. This worked no-problem with a less nested path for LOCATION
.
I'd like to be able to source data from all these different [foldername] folders. How do I make the recursive input selection go further in my nested directories?
Use this Hive settings to enable recursive directories:
set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;
Create external table and specify root directory as a location:
LOCATION 'hdfs://.../data'
You will be able to query data from table location and all subdirectories
这篇关于Hive使用来自嵌套子目录的输入创建表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!