当底层目录改变时,Hive表格可以自动更新 [英] Can Hive table automatically update when underlying directory is changed

查看:242
本文介绍了当底层目录改变时,Hive表格可以自动更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



创建外部表newtable(名称字符串)行格式分隔的字段已终止

如果我在某些S3(或HDFS)目录的顶部构建Hive表,如下所示: '''存储为文本文件位置's3a:// location / subdir /';



当我将文件添加到S3位置时,Hive表不会自动更新。只有在该位置创建新的Hive表时,才会包含新数据。有没有办法构建一个Hive表(可能使用分区),这样,无论何时将新文件添加到基础目录,Hive表都会自动显示该数据(而不必重新创建Hive表)?

$ b $在HDFS上,每次查询时,每个文件都被扫描为@Dudu Markovitz指出的那个文件。 HDFS中的文件立即保持一致。在S3文件创建后立即保持一致,并在删除或覆盖后最终一致。在s3表格文件夹中添加新文件时,在查询Hive表格时可立即访问它们。如果您正在重写文件,则S3中最终的一致性可能有问题。如果你重写文件,它们不是立即一致的,它们最终是一致的,请看这里: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel 。有几种方法可以消除最终的一致性问题,例如每次基于时间戳写入新创建的分区,或者基于时间戳或某个runID创建具有新位置的表。这个想法是每次创建新文件。
也看看这个: https://github.com/andrewgaul / are-we-consistent-yet



另外,在添加文件后查询表格时使用统计信息可能有问题,请参见: https://stackoverflow.com/a/39914232/2700344


If I build a Hive table on top of some S3 (or HDFS) directory like so:

Create external table newtable (name string) row format delimited fields terminated by ',' stored as textfile location 's3a://location/subdir/';

When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?

解决方案

On HDFS each file scanned each time table being queried as @Dudu Markovitz pointed. And files in HDFS are immediately consistent. On S3 files are immediately consistent after create and eventually consistent after delete or overwrite. When you add new files in s3 table folder they are immediately accessible when querying Hive table. There may be a problem with eventual consistency in S3 if you are rewriting files. If you rewrite files they are not immediately consistent, they are eventually consistent, see here: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel. There are few approaches to eliminate eventual consistency problem, such as writing each time newly created partition based on timestamp or dropping and creating table with new location based on timestamp or some runID. The idea is to create new files each time. Also have a look at this: https://github.com/andrewgaul/are-we-consistent-yet

Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344

这篇关于当底层目录改变时,Hive表格可以自动更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆