Grep 跨 Hadoop 文件系统中的多个文件 [英] Grep across multiple files in Hadoop Filesystem
问题描述
我正在使用 Hadoop,我需要在我的 Hadoop 文件系统中找到大约 100 个文件中的哪些包含某个字符串.
I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.
我可以像这样看到我想搜索的文件:
I can see the files I wish to search like this:
bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time
..它返回几个这样的条目:
..which returns several entries like this:
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab
我如何找到哪些包含字符串 bcd4bc3e1380a56108f486a4fffbc8dc
?一旦我知道,我可以手动编辑它们.
How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc
? Once I know, I can edit them manually.
推荐答案
这是一个 hadoop文件系统",而不是 POSIX 的,所以试试这个:
This is a hadoop "filesystem", not a POSIX one, so try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' |
while read f
do
hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done
这应该可以工作,但它是串行的,所以可能会很慢.如果您的集群可以承受热量,我们可以并行化:
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' |
xargs -n 1 -I ^ -P 10 bash -c
"hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
注意 xargs
的 -P 10
选项:这是我们将并行下载和搜索的文件数量.从低开始并增加数量,直到磁盘 I/O 或网络带宽饱和,无论您的配置是否相关.
Notice the -P 10
option to xargs
: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.
编辑:鉴于您使用的是 SunOS(有点脑残),试试这个:
EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
这篇关于Grep 跨 Hadoop 文件系统中的多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!