使用shell脚本获取最近5分钟内在hadoop中创建的文件 [英] Get files which are created in last 5 minutes in hadoop using shell script
问题描述
drwxrwx --- - root supergroup 0 2016-08-19 06:21 / tmp / logs / root / logs / application_1464962104018_1639064
drwxrwx --- root超级组0 2016-08-19 06:21 / tmp / logs / root / logs / application_1464962104018_1639065
现在 / tmp / logs / root / logs /
目录将不断获取新文件它。
我想获取最近五分钟内创建的文件,并考虑到当前时间。然后我需要将这些文件复制到本地机器中。
如何操作:
hdfs dfs -ls / tmp | tr -s| cut -d''-f6-8 | grep^ [0-9]| awk'BEGIN {MIN = 5; LAST = 60 * MIN; 日期+%s| getline NOW} {cmd =date -d'\''$ 1$ 2'\''+%s; cmd | getline WHEN; DIFF = NOW-WHEN; if(DIFF
说明:
列出所有文件:
hdfs dfs -ls / tmp
替换多余空格:
tr -s
获取所需的列:
cut -d' '-f6-8
删除不需要的行:
grep^ [0-9]
使用awk处理:
awk
初始化DIFF持续时间和当前时间:
MIN = 5; LAST = 60 * MIN; 日期+%s| getline now
创建一个命令获取HDFS上文件时间戳的纪元值:
cmd =date -d'\''$ 1$ 2'\''+%s;
执行命令获取HDFS文件的纪元值:
cmd | getline WHEN;
获取时差:
DIFF = NOW-WHEN;
根据差异打印输出:
if(DIFF
您只需将 MIN
取决于您的要求(这里是5分钟)。
HTH
I have files in HDFS as:
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639064
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639065
Now /tmp/logs/root/logs/
directory will continuously get the new files in it.
I want to get the files which are created in last five minutes, taking current time into account. Then I need to copy these files into my local machine.
How about this:
hdfs dfs -ls /tmp | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=5; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF < LAST){ print $3 }}'
Explanation:
List all the files:
hdfs dfs -ls /tmp
Replace extra spaces:
tr -s " "
Get the required columns:
cut -d' ' -f6-8
Remove non-required rows:
grep "^[0-9]"
Processing using awk:
awk
Initialize the DIFF duration and current time:
MIN=5; LAST=60*MIN; "date +%s" | getline NOW
Create a command to get the epoch value for timestamp of the file on HDFS:
cmd="date -d'\''"$1" "$2"'\'' +%s";
Execute the command to get epoch value for HDFS file:
cmd | getline WHEN;
Get the time difference:
DIFF=NOW-WHEN;
Print the output depending upon the difference:
if(DIFF < LAST){ print $3 }
You just need to change the variable value for MIN
depending upon your requirement (here its 5 minutes).
HTH
这篇关于使用shell脚本获取最近5分钟内在hadoop中创建的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!