在Hadoop文件系统中的多个文件中进行grep [英] Grep across multiple files in Hadoop Filesystem

查看：856 发布时间：2018/5/28 19:27:38 bash shell unix hadoop grep

本文介绍了在Hadoop文件系统中的多个文件中进行grep的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Hadoop，我需要找到Hadoop文件系统中的〜100个文件中包含特定字符串的文件。

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

我可以看到我希望的文件像这样搜索：

I can see the files I wish to search like this:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..它会返回如下几个条目：

..which returns several entries like this:

-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa -rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

如何找到哪些包含字符串 bcd4bc3e1380a56108f486a4fffbc8dc ？一旦我知道，我可以手动编辑它们。

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

推荐答案

这是一个hadoop文件系统这：

This is a hadoop "filesystem", not a POSIX one, so try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \ while read f do hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f done

这应该可行，但它是串行的，因此可能会很慢。如果您的集群可以加热，我们可以并行：

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \ xargs -n 1 -I ^ -P 10 bash -c \ "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

注意 -P 10 选项为 xargs ：这是我们将下载和并行搜索的文件数量。从低开始增加数字，直到您使磁盘I / O或网络带宽达到饱和，无论您的配置如何。

Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

编辑：鉴于此你在SunOS上（它有点脑残）试试这个：

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done

这篇关于在Hadoop文件系统中的多个文件中进行grep的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Hadoop文件系统中的多个文件中进行grep [英] Grep across multiple files in Hadoop Filesystem

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

在Hadoop文件系统中的多个文件中进行grep [英] Grep across multiple files in Hadoop Filesystem

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭