文件没有正确放入分布式缓存 [英] Files not put correctly into distributed cache

查看:160
本文介绍了文件没有正确放入分布式缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码将文件添加到分布式缓存中:

I am adding a file to distributed cache using the following code:

Configuration conf2 = new Configuration();      
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

然后我将这个文件读入映射器:

Then I read the file into the mappers:

protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();

URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));

String line;
        try {
              while ((line = joinReader.readLine()) != null) {
              s = line.toString().split("\t");
                  do stuff to s
                } finally {
                   joinReader.close();
                }

问题是我只读一行,而不是我正在放入缓存中的文件。相反,它是:cm9vdA ==或root64中的根。

The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.

是否有其他人有这个问题,或者看到我错误地使用分布式缓存?我使用Hadoop 0.20.2完全分布式。

Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.

推荐答案



Common mistake in your job configuration:

Configuration conf2 = new Configuration();      
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

创建Job对象后,您需要将作为作业的副本它,并且在创建作业之后在conf2中配置值将对作业本身没有影响。试试这个:

After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:

job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

您还应该检查分布式缓存中的文件数量,可能有多个文件'打开一个随机文件,它给你你看到的值。

You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.

我建议你使用symlinking,它会使文件在本地工作目录中可用,一个已知的名称:

I suggest you use symlinking which will make the files available in the local working directory, and with a known name:

DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);

// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

这篇关于文件没有正确放入分布式缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆