Hive 使用来自嵌套子目录的输入创建表 [英] Hive create table with inputs from nested sub-directories
问题描述
我在 HDFS 中的文件路径中有 Avro 格式的数据,例如:/data/logs/[foldername]/[filename].avro
.我想在所有这些日志文件上创建一个 Hive 表,即 /data/logs/*/*
形式的所有文件.(它们都基于相同的 Avro 架构.)
I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro
. I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/*
. (They're all based on the same Avro schema.)
我正在使用标志 mapred.input.dir.recursive=true
运行以下查询:
I'm running the below query with flag mapred.input.dir.recursive=true
:
CREATE EXTERNAL TABLE default.testtable
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://.../data/*/*'
TBLPROPERTIES (
'avro.schema.url'='hdfs://.../schema.avsc')
除非我将 LOCATION
更改为更少嵌套,否则表格最终为空,即为 'hdfs://.../data/[foldername]/'
具有特定的文件夹名称.对于 LOCATION
的嵌套路径较少,这没有问题.
The table ends up being empty unless I change LOCATION
to be less nested, i.e. to be 'hdfs://.../data/[foldername]/'
with a certain foldername. This worked no-problem with a less nested path for LOCATION
.
我希望能够从所有这些不同的 [文件夹名] 文件夹中获取数据.如何使递归输入选择在嵌套目录中走得更远?
I'd like to be able to source data from all these different [foldername] folders. How do I make the recursive input selection go further in my nested directories?
推荐答案
使用此 Hive 设置启用递归目录:
Use this Hive settings to enable recursive directories:
set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;
创建外部表并指定根目录作为位置:
Create external table and specify root directory as a location:
LOCATION 'hdfs://.../data'
您将能够从表位置和所有子目录中查询数据
You will be able to query data from table location and all subdirectories
这篇关于Hive 使用来自嵌套子目录的输入创建表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!