如何为hadoop streaming指定分区器 [英] How to specify the partitioner for hadoop streaming

查看:302
本文介绍了如何为hadoop streaming指定分区器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像下面这样的自定义分区:

  import java.util。*; 
import org.apache.hadoop.mapreduce。*;

public static class SignaturePartitioner extends Partitioner< Text,Text>
{
@Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return(key.toString()。Split('') [0] .hashCode()& Integer.MAX_VALUE)%numReduceTasks;


我设置了如下所示的hadoop streaming参数

  -file SignaturePartitioner.java \ 
-partitioner SignaturePartitioner \

然后我得到一个错误:Class Not Found。



您知道问题出在哪里吗?



最好的问候,



根本原因是流媒体 - 2.6.0.jar使用mapred api而不是mapreduce api。另外,实现Partitioner接口,而不是扩展Partitioner类。以下对我有用:

  import java.io.IOException; 
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;`

public class Mypartitioner实现分区程序<文本,文本> {`

public void configure(JobConf job){}
public int getPartition(Text pkey,Text pvalue,int pnumparts)
{
if(pkey.toString ().startsWith(a))
return 0;
else返回1;
}
}

编译Mypartitioner,创建jar,然后



  bin / hadoop jar share / hadoop / tools / lib / hadoop-streaming-2.6.0.jar 
-libjars / home / sanjiv / hadoop-2.6.0 / Mypartitioner.jar
-D mapreduce.job.reduces = 2
-files /home/sanjiv/mymapper.sh,/home/sanjiv/myreducer.sh
-input indir -output outdir -mapper mymapper.sh
-reducer myreducer.sh -partitioner Mypartitioner


I have a custom partitioner like below:

import java.util.*;
import org.apache.hadoop.mapreduce.*;

public static class SignaturePartitioner extends Partitioner<Text,Text>
{
    @Override
    public int getPartition(Text key,Text value,int numReduceTasks)
    {
        return (key.toString().Split(' ')[0].hashCode() & Integer.MAX_VALUE) % numReduceTasks;
    }  
}

I set the hadoop streaming parameter like below

 -file SignaturePartitioner.java \
 -partitioner SignaturePartitioner \

Then I get an error: Class Not Found.

Do you know what's the problem?

Best Regards,

解决方案

I faced the same issue, but managed to solve after lot of research.

Root cause is streaming-2.6.0.jar uses mapred api and not mapreduce api. Also, implement Partitioner interface, and not extend Partitioner class. The following worked for me:

import java.io.IOException;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.Partitioner;
 import org.apache.hadoop.mapred.JobConf;`

public class Mypartitioner implements Partitioner<Text, Text> {`

public void configure(JobConf job) {}
 public int getPartition(Text pkey, Text pvalue, int pnumparts)
    {
      if (pkey.toString().startsWith("a"))
       return 0;
      else  return 1 ;
    }
  }

compile Mypartitioner, create jar, and then,

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
-libjars /home/sanjiv/hadoop-2.6.0/Mypartitioner.jar
-D mapreduce.job.reduces=2
-files /home/sanjiv/mymapper.sh,/home/sanjiv/myreducer.sh
-input indir -output outdir -mapper mymapper.sh
-reducer myreducer.sh -partitioner Mypartitioner

这篇关于如何为hadoop streaming指定分区器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆