搜索1000文件的字符串,每个文件大小为1GB [英] Search for a String in 1000 files and each file size is 1GB

查看:128
本文介绍了搜索1000文件的字符串,每个文件大小为1GB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作在SunOS(这是稍微脑死亡)。以下是磁盘吞吐量为上述Solaris计算机 -

I am working on SunOS (which is slightly brain-dead). And below is the Disk Throughput for the above Solaris Machine-

bash-3.00$ iostat -d 1 10
    sd0           sd1           sd2           sd3
kps tps serv  kps tps serv  kps tps serv  kps tps serv
  0   0    0  551  16    8  553  16    8  554  16    8
  0   0    0  701  11   25    0   0    0  1148  17   33
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0
  0   0    0    0   0    0    0   0    0    0   0    0

问题陈述

我身边有 1000个文件每个文件是 1GB 的大小。我需要找到一个字符串在所有这些 1000个文件,并哪些文件包含特定的字符串。我与的Hadoop文件系统和所有 1000个文件的工作是Hadoop的文件系统。

I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.

所有 1000个文件正在实时文件夹,所以如果我做这样的下面,我将获得所有的 1000个文件。我需要找到哪些文件包含一个特定的字符串。

All the 1000 files are under real-time folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String.

bash-3.00$ hadoop fs -ls /apps/technology/b_dps/real-time

因此​​,对于上述问题的陈述,我使用下面的命令,将发现所有的文件,其中包含特定字符串 -

So for the above problem statement, I am using the below command that will find all the files which contains the particular string-

hadoop fs -ls /apps/technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done

因此​​,在上述情况下,它会发现所有包含此字符串的 cec7051a1380a47a4497a107fecb84c1 的文件。它工作正常,我和我能够获得包含特定字符串的文件名。

So in the above case it will find all the files which contains this string cec7051a1380a47a4497a107fecb84c1. And it is working fine for me and I am able to get the file names which contains the particular string.

我的问题是 -

但上述命令的缺点是,它的非常非常慢。那么,有没有办法,我们可以并行上面的命令或作出上述命令来搜索文件速度快了很多?

But the problem with above command is, it is very very slow. So is there any way we can parallelize the above command or make the above command to search the files a lot faster?

任何建议将AP preciated。

Any suggestions will be appreciated.

推荐答案

您可以从grep的类的提示。它配备了在示例文件夹的分布。

You can get a hint from grep class. It comes with the distribution in the example folder.

./斌/ Hadoop的罐子的Hadoop-MA preD-例子,0.22.0.jar grep的输入输出的正则表达式

./bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output regex

有关这个类的实现细节源,你可以去到该目录。 SRC \\例子\\组织\\阿帕奇\\ Hadoop的\\实例随附分配

For details source on the implementation of this class you can go to the directory. "src\examples\org\apache\hadoop\examples" that comes with the distribution

 Job searchjob = new Job(conf);    
 FileInputFormat.setInputPaths("job Name", "input direcotory in hdfs");
      searchjob.setMapperClass(SearchMapper.class);    
      searchjob.setCombinerClass(LongSumReducer.class);
      searchjob.setReducerClass(LongSumReducer.class);


在你SearchMapper.class你可以做到这一点。

   public void map(K key, Text value,
                      OutputCollector<Text, LongWritable> output,
                      Reporter reporter)
        throws IOException {
        String text = value.toString();
        Matcher matcher = pattern.matcher(text);
        if(matcher.find()) {
          output.collect(key,value);
}

这篇关于搜索1000文件的字符串,每个文件大小为1GB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆