Hadoop HDFS中的数据保留 [英] Data retention in Hadoop HDFS

查看:225
本文介绍了Hadoop HDFS中的数据保留的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个Hadoop集群,HDFS中的数据量超过100TB。我想删除某些Hive表中超过13周的数据。



是否有任何工具或方法可以实现此目的?

谢谢 $ b $为了删除比特定时间段早的数据,你有几个选项。

首先,如果Hive表按日期分区,您可以简单地删除Hive中的分区并删除其基础目录。第二种选择是将INSERT运行到新表中,使用日期戳记(如果可用)过滤掉旧数据。这可能不是一个好的选择,因为你有100TB的数据。

第三个选项是递归列出Hive表的数据目录。 hadoop fs -lsr / path / to / hive / table 。这将输出文件及其创建日期的列表。您可以获取该输出,提取日期并与您想要保留的时间范围进行比较。如果文件比较老,那么你想保留,在其上运行 hadoop fs -rm< file>

第四个选项是获取FSImage的副本: curl --silenthttp://< active namenode>:50070 / getimage?getimage = 1& txid = latest-o hdfs.image 接下来将其转换为文本文件。 hadoop oiv -i hdfs.image -o hdfs.txt 。该文本文件将包含HDFS的文本表示,与 hadoop fs -ls ... 将返回的相同。

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.

Are there any tools or way I can achieve this?

Thank you

解决方案

To delete data older then a certain time frame, you have a few options.

First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.

Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.

A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.

A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

这篇关于Hadoop HDFS中的数据保留的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆