搜索1000文件的字符串,每个文件大小为1GB [英] Search for a String in 1000 files and each file size is 1GB
问题描述
我的工作在SunOS(这是稍微脑死亡)。以下是磁盘吞吐量为上述Solaris计算机 -
I am working on SunOS (which is slightly brain-dead). And below is the Disk Throughput for the above Solaris Machine-
bash-3.00$ iostat -d 1 10
sd0 sd1 sd2 sd3
kps tps serv kps tps serv kps tps serv kps tps serv
0 0 0 551 16 8 553 16 8 554 16 8
0 0 0 701 11 25 0 0 0 1148 17 33
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
问题陈述
我身边有 1000个文件
每个文件是 1GB
的大小。我需要找到一个字符串
在所有这些 1000个文件
,并哪些文件包含特定的字符串。我与的Hadoop文件系统
和所有 1000个文件的工作
是Hadoop的文件系统。
I have around 1000 files
and each file is of the size of 1GB
. And I need to find a String
in all these 1000 files
and also which files contains that particular String. I am working with Hadoop File System
and all those 1000 files
are in Hadoop File System.
所有 1000个文件
正在实时
文件夹,所以如果我做这样的下面,我将获得所有的 1000个文件
。我需要找到哪些文件包含一个特定的字符串。
All the 1000 files
are under real-time
folder, so If I do like this below, I will be getting all the 1000 files
. And I need to find which files contains a particular String.
bash-3.00$ hadoop fs -ls /apps/technology/b_dps/real-time
因此,对于上述问题的陈述,我使用下面的命令,将发现所有的文件,其中包含特定字符串 -
So for the above problem statement, I am using the below command that will find all the files which contains the particular string-
hadoop fs -ls /apps/technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done
因此,在上述情况下,它会发现所有包含此字符串的 cec7051a1380a47a4497a107fecb84c1 的文件。它工作正常,我和我能够获得包含特定字符串的文件名。
So in the above case it will find all the files which contains this string cec7051a1380a47a4497a107fecb84c1. And it is working fine for me and I am able to get the file names which contains the particular string.
我的问题是 -
但上述命令的缺点是,它的非常非常慢。那么,有没有办法,我们可以并行
上面的命令或作出上述命令来搜索文件速度快了很多?
But the problem with above command is, it is very very slow. So is there any way we can parallelize
the above command or make the above command to search the files a lot faster?
任何建议将AP preciated。
Any suggestions will be appreciated.
推荐答案
您可以从grep的类的提示。它配备了在示例文件夹的分布。
You can get a hint from grep class. It comes with the distribution in the example folder.
./斌/ Hadoop的罐子的Hadoop-MA preD-例子,0.22.0.jar grep的输入输出的正则表达式
./bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output regex
有关这个类的实现细节源,你可以去到该目录。 SRC \\例子\\组织\\阿帕奇\\ Hadoop的\\实例随附分配
For details source on the implementation of this class you can go to the directory. "src\examples\org\apache\hadoop\examples" that comes with the distribution
Job searchjob = new Job(conf);
FileInputFormat.setInputPaths("job Name", "input direcotory in hdfs");
searchjob.setMapperClass(SearchMapper.class);
searchjob.setCombinerClass(LongSumReducer.class);
searchjob.setReducerClass(LongSumReducer.class);
在你SearchMapper.class你可以做到这一点。
public void map(K key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String text = value.toString();
Matcher matcher = pattern.matcher(text);
if(matcher.find()) {
output.collect(key,value);
}
这篇关于搜索1000文件的字符串,每个文件大小为1GB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!