Grep 跨 Hadoop 文件系统中的多个文件 [英] Grep across multiple files in Hadoop Filesystem

查看:25
本文介绍了Grep 跨 Hadoop 文件系统中的多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Hadoop,我需要在我的 Hadoop 文件系统中找到大约 100 个文件中的哪些包含某个字符串.

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

我可以像这样看到我想搜索的文件:

I can see the files I wish to search like this:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..它返回几个这样的条目:

..which returns several entries like this:

-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

我如何找到哪些包含字符串 bcd4bc3e1380a56108f486a4fffbc8dc?一旦我知道,我可以手动编辑它们.

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

推荐答案

这是一个 hadoop文件系统",而不是 POSIX 的,所以试试这个:

This is a hadoop "filesystem", not a POSIX one, so try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | 
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done

这应该可以工作,但它是串行的,所以可能会很慢.如果您的集群可以承受热量,我们可以并行化:

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | 
  xargs -n 1 -I ^ -P 10 bash -c 
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

注意 xargs-P 10 选项:这是我们将并行下载和搜索的文件数量.从低开始并增加数量,直到磁盘 I/O 或网络带宽饱和,无论您的配置是否相关.

Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

编辑:鉴于您使用的是 SunOS(有点脑残),试试这个:

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done

这篇关于Grep 跨 Hadoop 文件系统中的多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆