在Hadoop文件系统中的多个文件中进行grep [英] Grep across multiple files in Hadoop Filesystem

查看:856
本文介绍了在Hadoop文件系统中的多个文件中进行grep的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Hadoop,我需要找到Hadoop文件系统中的〜100个文件中包含特定字符串的文件。

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

我可以看到我希望的文件像这样搜索:

I can see the files I wish to search like this:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..它会返回如下几个条目:

..which returns several entries like this:

-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

如何找到哪些包含字符串 bcd4bc3e1380a56108f486a4fffbc8dc ?一旦我知道,我可以手动编辑它们。

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

推荐答案

这是一个hadoop文件系统这:

This is a hadoop "filesystem", not a POSIX one, so try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done

这应该可行,但它是串行的,因此可能会很慢。如果您的集群可以加热,我们可以并行:

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
  xargs -n 1 -I ^ -P 10 bash -c \
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

注意 -P 10 选项为 xargs :这是我们将下载和并行搜索的文件数量。从低开始增加数字,直到您使磁盘I / O或网络带宽达到饱和,无论您的配置如何。

Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

编辑:鉴于此你在SunOS上(它有点脑残)试试这个:

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done

这篇关于在Hadoop文件系统中的多个文件中进行grep的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆