Hadoop - 复合键 [英] Hadoop - composite key

查看:117
本文介绍了Hadoop - 复合键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个制表符分隔的文件,其中包含用户活动数据,格式如下:

  timestamp user_id page_id action_id 

我想编写一个hadoop作业来计算每个页面上的用户操作,所以输出文件应该如下所示:

  user_id page_id number_of_actions 

我需要这里的组合键 - 它将包含user_id和page_id。有没有什么通用的方法来做到这一点与hadoop?我找不到任何有用的东西。到目前为止,我在mapper中发布了这样的密钥:

  context.write(new Text(user_id +\ t + page_id),一个); 

它可以工作,但我觉得这不是最好的解决方案。

解决方案

只需撰写您自己的 Writable 。在您的示例中,解决方案可能如下所示:

  public class UserPageWritable implements WritableComparable< UserPageWritable> {

private String userId;
私人字符串pageId;

@Override
public void readFields(DataInput in)throws IOException {
userId = in.readUTF();
pageId = in.readUTF();
}

@Override
public void write(DataOutput out)throws IOException {
out.writeUTF(userId);
out.writeUTF(pageId);
}

@Override
public int compareTo(UserPageWritable o){
return CompareChain.start()。compare(userId,o.userId)
.compare(pageId,o.pageId).result();
}

}

尽管我认为你的ID可能是一个 long ,在这里你有 String 版本。基本上只是在 Writable 接口上的正常序列化,注意它需要默认的构造函数,所以你应该总是提供一个。
$ b

compareTo 逻辑显然告诉我们如何对数据集进行排序,并告诉reducer哪些元素相等,这样他们就可以分组。



ComparisionChain 番石榴



不要忘记重写equals和hashcode!通过密钥的哈希码减少。


Suppose I have a tab delimited file containing user activity data formatted like this:

timestamp  user_id  page_id  action_id

I want to write a hadoop job to count user actions on each page, so the output file should look like this:

user_id  page_id  number_of_actions

I need something like composite key here - it would contain user_id and page_id. Is there any generic way to do this with hadoop? I couldn't find anything helpful. So far I'm emitting key like this in mapper:

context.write(new Text(user_id + "\t" + page_id), one);

It works, but I feel that it's not the best solution.

解决方案

Just compose your own Writable. In your example a solution could look like this:

public class UserPageWritable implements WritableComparable<UserPageWritable> {

  private String userId;
  private String pageId;

  @Override
  public void readFields(DataInput in) throws IOException {
    userId = in.readUTF();
    pageId = in.readUTF();
  }

  @Override
  public void write(DataOutput out) throws IOException {
    out.writeUTF(userId);
    out.writeUTF(pageId);
  }

  @Override
  public int compareTo(UserPageWritable o) {
    return ComparisonChain.start().compare(userId, o.userId)
        .compare(pageId, o.pageId).result();
  }

}

Although I think your IDs could be a long, here you have the String version. Basically just the normal serialization over the Writable interface, note that it needs the default constructor so you should always provide one.

The compareTo logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.

ComparisionChain is a nice util of Guava.

Don't forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.

这篇关于Hadoop - 复合键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆