文件没有正确放入分布式缓存 [英] Files not put correctly into distributed cache
问题描述
我使用以下代码将文件添加到分布式缓存中:
I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
然后我将这个文件读入映射器:
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
问题是我只读一行,而不是我正在放入缓存中的文件。相反,它是:cm9vdA ==或root64中的根。
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
是否有其他人有这个问题,或者看到我错误地使用分布式缓存?我使用Hadoop 0.20.2完全分布式。
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.
推荐答案
Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
创建Job对象后,您需要将作为作业的副本它,并且在创建作业之后在conf2中配置值将对作业本身没有影响。试试这个:
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
您还应该检查分布式缓存中的文件数量,可能有多个文件'打开一个随机文件,它给你你看到的值。
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
我建议你使用symlinking,它会使文件在本地工作目录中可用,一个已知的名称:
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));
这篇关于文件没有正确放入分布式缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!