Hadoop - 复合键 [英] Hadoop - composite key
问题描述
假设我有一个以制表符分隔的文件,其中包含格式如下的用户活动数据:
Suppose I have a tab delimited file containing user activity data formatted like this:
timestamp user_id page_id action_id
我想编写一个 hadoop 作业来计算每个页面上的用户操作,因此输出文件应如下所示:
I want to write a hadoop job to count user actions on each page, so the output file should look like this:
user_id page_id number_of_actions
我需要像复合键这样的东西 - 它会包含 user_id 和 page_id.有没有什么通用的方法可以用 hadoop 做到这一点?我找不到任何有用的东西.到目前为止,我在映射器中发出这样的键:
I need something like composite key here - it would contain user_id and page_id. Is there any generic way to do this with hadoop? I couldn't find anything helpful. So far I'm emitting key like this in mapper:
context.write(new Text(user_id + " " + page_id), one);
它有效,但我觉得这不是最好的解决方案.
It works, but I feel that it's not the best solution.
推荐答案
只需编写您自己的 Writable
.在您的示例中,解决方案可能如下所示:
Just compose your own Writable
. In your example a solution could look like this:
public class UserPageWritable implements WritableComparable<UserPageWritable> {
private String userId;
private String pageId;
@Override
public void readFields(DataInput in) throws IOException {
userId = in.readUTF();
pageId = in.readUTF();
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(userId);
out.writeUTF(pageId);
}
@Override
public int compareTo(UserPageWritable o) {
return ComparisonChain.start().compare(userId, o.userId)
.compare(pageId, o.pageId).result();
}
}
虽然我认为您的 ID 可能是 long
,但这里您有 String
版本.基本上只是对 Writable
接口的正常序列化,注意它需要默认构造函数,所以你应该总是提供一个.
Although I think your IDs could be a long
, here you have the String
version. Basically just the normal serialization over the Writable
interface, note that it needs the default constructor so you should always provide one.
compareTo
逻辑清楚地告诉了如何对数据集进行排序,并告诉化简器哪些元素是相等的,以便可以将它们分组.
The compareTo
logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.
ComparisionChain
是 Guava 的一个很好的工具.
ComparisionChain
is a nice util of Guava.
不要忘记覆盖equals和hashcode!partitioner会根据key的hashcode来确定reducer.
Don't forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.
这篇关于Hadoop - 复合键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!