在Hadoop文件系统中的多个文件中进行grep [英] Grep across multiple files in Hadoop Filesystem
问题描述
我正在使用Hadoop,我需要找到Hadoop文件系统中的〜100个文件中包含特定字符串的文件。
I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.
我可以看到我希望的文件像这样搜索:
I can see the files I wish to search like this:
bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time
..它会返回如下几个条目:
..which returns several entries like this:
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab
如何找到哪些包含字符串 bcd4bc3e1380a56108f486a4fffbc8dc
?一旦我知道,我可以手动编辑它们。
How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc
? Once I know, I can edit them manually.
推荐答案
这是一个hadoop文件系统这:
This is a hadoop "filesystem", not a POSIX one, so try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done
这应该可行,但它是串行的,因此可能会很慢。如果您的集群可以加热,我们可以并行:
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
xargs -n 1 -I ^ -P 10 bash -c \
"hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
注意 -P 10
选项为 xargs
:这是我们将下载和并行搜索的文件数量。从低开始增加数字,直到您使磁盘I / O或网络带宽达到饱和,无论您的配置如何。
Notice the -P 10
option to xargs
: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.
编辑:鉴于此你在SunOS上(它有点脑残)试试这个:
EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
这篇关于在Hadoop文件系统中的多个文件中进行grep的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!