如何在 Reducer 输出中对逗号分隔的键进行排序? [英] How to sort comma separated keys in Reducer ouput?

查看:20
本文介绍了如何在 Reducer 输出中对逗号分隔的键进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 MapReduce 运行 RFM 分析程序.OutputKeyClass 是 Text.class,我从 Reducer 发出逗号分隔的 R(Recency)、F(频率)、M(Monetory)作为键,其中 R=BigInteger、F=Binteger、M=BigDecimal,并且值也是 Text代表 Customer_ID.我知道 Hadoop 根据键对输出进行排序,但我的最终结果有点奇怪.我希望输出键首先按 R 排序,然后按 F 排序,然后按 M 排序.但是由于未知原因,我得到以下输出排序顺序:

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is also a Text representing Customer_ID. I know that Hadoop sorts output based on keys but my final result is a bit wierd. I want the output keys to be sorted by R first, then F and then M. But I am getting the following output sort order for unknown reasons:

545,1,7652    100000
545,23,390159.402343750    100001
452,13,132586    100002
452,4,32202    100004
452,1,9310    100007
452,1,4057    100018
452,3,18970    100021

但我想要以下输出:

545,23,390159.402343750    100001
545,1,7652    100000
452,13,132586    100002
452,4,32202    100004
452,3,18970    100021
452,1,9310    100007
452,1,4057    100018

注意:customer_ID 是 Map 阶段的键,属于特定 Customer_ID 的所有 RFM 值都在 Reducer 处汇总以进行聚合.

NOTE: The customer_ID was the key in Map phase and all the RFM values belonging to a particular Customer_ID are brought together at the Reducer for aggregation.

推荐答案

所以经过大量搜索,我找到了一些有用的材料,我现在将其整理出来:

So after a lot of searching I found some useful material the compilation of which I am posting now:

  1. 您必须从自定义数据类型开始.由于我有三个逗号分隔的值需要按降序排序,因此我必须在 Hadoop 中创建一个 TextQuadlet.java 数据类型.我创建四联体的原因是因为键的第一部分将是自然键,其余三个部分将是 R、F、M:

  1. You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:

import java.io.*;
import org.apache.hadoop.io.*;
public class TextQuadlet implements WritableComparable<TextQuadlet> {
private String customer_id;
private long R;
private long F;
private double M;
public TextQuadlet() {
}
public TextQuadlet(String customer_id, long R, long F, double M) {
    set(customer_id, R, F, M);
}
public void set(String customer_id2, long R2, long F2, double M2) {
    this.customer_id = customer_id2;
    this.R = R2;
    this.F = F2;
    this.M=M2;
}
public String getCustomer_id() {
    return customer_id;
}
public long getR() {
    return R;
}
public long getF() {
    return F;
}
public double getM() {
    return M;
}
@Override
public void write(DataOutput out) throws IOException {
    out.writeUTF(this.customer_id);
    out.writeLong(this.R);
    out.writeLong(this.F);
    out.writeDouble(this.M);
}
@Override
public void readFields(DataInput in) throws IOException {
    this.customer_id = in.readUTF();
    this.R = in.readLong();
    this.F = in.readLong();
    this.M = in.readDouble();
}
// This hashcode function is important as it is used by the custom
// partitioner for this class.
@Override
public int hashCode() {
    return (int) (customer_id.hashCode() * 163 + R + F + M);
}
@Override
public boolean equals(Object o) {
    if (o instanceof TextQuadlet) {
        TextQuadlet tp = (TextQuadlet) o;
        return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
    }
    return false;
}
@Override
public String toString() {
    return customer_id + "," + R + "," + F + "," + M;
}
// LHS in the conditional statement is the current key
// RHS in the conditional statement is the previous key
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// Returning 0 or a positive value means that you are keeping the
// order as it is
@Override
public int compareTo(TextQuadlet tp) {
// Here my natural is is customer_id and I don't even take it into
// consideration.

// So as you might have concluded, I am sorting R,F,M descendingly.
    if (this.R != tp.R) {
        if(this.R < tp.R) {
            return 1;
        }
        else{
            return -1;
        }
    }
    if (this.F != tp.F) {
        if(this.F < tp.F) {
            return 1;
        }
        else{
            return -1;
        }
    }
    if (this.M != tp.M){
        if(this.M < tp.M) {
            return 1;
        }
        else{
            return -1;
        }
    }
    return 0;
}
public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
    int cmp = tp1.compareTo(tp2);
    return cmp;
}
public static int compare(Text customer_id1, Text customer_id2) {
    int cmp = customer_id1.compareTo(customer_id1);
    return cmp;
}
}

  • 接下来,您需要一个自定义分区器,以便所有具有相同键的值都在一个 reducer 中结束:

  • Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:

    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Partitioner;
    
    public class FirstPartitioner_RFM extends Partitioner<TextQuadlet, Text> {
    @Override
    public int getPartition(TextQuadlet key, Text value, int numPartitions) {
        return (int) key.hashCode() % numPartitions;
       }
    }
    

  • 第三,您需要一个自定义组比较器,以便所有值按其自然键 customer_id 而不是复合键 customer_id 组合在一起,R,F,M:

    import org.apache.hadoop.io.WritableComparable;
    import org.apache.hadoop.io.WritableComparator;
    
    public class GroupComparator_RFM_N extends WritableComparator {
    protected GroupComparator_RFM_N() {
        super(TextQuadlet.class, true);
    }
    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        TextQuadlet ip1 = (TextQuadlet) w1;
        TextQuadlet ip2 = (TextQuadlet) w2;
        // Here we tell hadoop to group the keys by their natural key.
        return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
        }
    }
    

  • 第四,您需要一个键比较器,它将再次根据 R、F、M 降序对键进行排序,并实现与 TextQuadlet.java 中使用的相同的排序技术.由于我在编码时迷路了,我稍微改变了我在此函数中比较数据类型的方式,但底层逻辑与 TextQuadlet.java 中的相同:

  • Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:

    import org.apache.hadoop.io.WritableComparable;
    import org.apache.hadoop.io.WritableComparator;
    
    public class KeyComparator_RFM extends WritableComparator {
    protected KeyComparator_RFM() {
        super(TextQuadlet.class, true);
    }
    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        TextQuadlet ip1 = (TextQuadlet) w1;
        TextQuadlet ip2 = (TextQuadlet) w2;
        // LHS in the conditional statement is the current key-value pair
        // RHS in the conditional statement is the previous key-value pair
        // When you return a negative value, it means that you are exchanging
        // the positions of current and previous key-value pair
        // If you are comparing strings, the string which ends up as the argument
        // for the `compareTo` method turns out to be the previous key and the
        // string which is invoking the `compareTo` method turns out to be the
        // current key.
        if(ip1.getR() == ip2.getR()){
            if(ip1.getF() == ip2.getF()){
                if(ip1.getM() == ip2.getM()){
                    return 0;
                }
                else{
                    if(ip1.getM() < ip2.getM())
                        return 1;
                    else
                        return -1;
                }
            }
            else{
                if(ip1.getF() < ip2.getF())
                    return 1;
                else
                    return -1;
            }
        }
        else{
            if(ip1.getR() < ip2.getR())
                return 1;
            else
                return -1;
            }
        }
    }
    

  • 最后,在您的驱动程序类中,您必须包含我们的自定义类.在这里,我使用 TextQuadlet,Text 作为 k-v 对.但您可以根据需要选择任何其他课程.:

  • And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:

    job.setPartitionerClass(FirstPartitioner_RFM.class);
    job.setSortComparatorClass(KeyComparator_RFM.class);
    job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
    job.setMapOutputKeyClass(TextQuadlet.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(TextQuadlet.class);
    job.setOutputValueClass(Text.class);
    

  • 如果我在代码或解释中的某个地方在技术上出错,请纠正我,因为我的这个答案完全基于我在互联网上阅读的个人理解,它非常适合我.

    Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

    这篇关于如何在 Reducer 输出中对逗号分隔的键进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆