Hadoop:您可以使用一对值作为“键"吗? [英] Hadoop: can you use a pair of values as "Key"?
问题描述
我正在尝试分析大型犯罪统计数据集,该文件大约为CSV格式的2 GB.大约有20列,但我只对其中一部分感兴趣:Crime_Type和Crime_in_Year.例如,犯罪类型盗窃",每年发生在2001年至2013年之间.我想得到一个统计每年入室盗窃发生次数的结果.
I am trying to analyze a big crimes statistics data set, the file is about 2 GB in CSV format. There are about 20 columns, but I am interested in only a subset of it: Crime_Type and Crime_in_Year. For example, the crime type "burglary", it happens from 2001 through 2013, every year. I want to have a result that counts the occurrences of burglary in each year.
因此,我正在考虑拥有一个密钥,该值将是2003年它的总和.在hadoop/mapreduce中是否可以有一对值作为密钥?
So I am thinking of having a key , and the value will be the sum of its occurrence in year 2003. Is it possible to have a pair of value as key in hadoop/mapreduce?
推荐答案
Key
可以是任何东西,只要它实现Writable
.您可以很容易地编写自己的自定义密钥,如下所示:这里.
A Key
can be anything so long as it implements Writable
. You could write your own custom key pretty easily as shown here.
因此,从文档中借用一个实现可能是
So borrowing from the documentation, one implementation might be
public class CrimeWritable implements Writable {
private int year;
private String type;
public void write(DataOutput out) throws IOException {
out.writeInt(year);
out.writeBytes(type);
}
public void readFields(DataInput in) throws IOException {
year = in.readInt();
type = in.readBytes();
}
public static CrimeWritable read(DataInput in) throws IOException {
CrimeWritable w = new CrimeWritable();
w.readFields(in);
return w;
}
}
在相关说明中,您可能要考虑使用比map-reduce更高级别的抽象,例如层叠或 Apache Spark .
On a related note, you might want to consider using a higher level abstraction than map-reduce like Cascading or Apache Spark.
这篇关于Hadoop:您可以使用一对值作为“键"吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!