如何从Hadoop序列文件中获取上次修改日期? [英] How do I get last modified date from a Hadoop Sequence File?
问题描述
我正在使用将BinaryFiles(jpegs)转换为Hadoop序列文件(HSF)的映射器:
I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
然后我有第二个映射器读取HSF,因此:
I then have a second mapper that reads the HSF, thus:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the PHash for this specific file
String PHashStr;
try {
PHashStr = calculatePhash(value.getBytes());
,calculatePhash是:
and calculatePhash is:
static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
//get the PHash for this specific data
//PHash requires inputstream rather than byte array
InputStream is = new ByteArrayInputStream(imageData);
String ph;
try {
ImagePHash ih = new ImagePHash();
ph = ih.getHash(is);
System.out.println ("file: " + is.toString() + " phash: " +ph);
} catch (Exception e) {
e.printStackTrace();
return "Internal error with ImagePHash.getHash";
}
return ph;
这一切都很好,但是我想让calculatePhash写出每个jpeg的最后修改日期.我知道我可以使用file.lastModified()
来获取文件中的最后修改日期,但是有没有办法在map或calculatePhash中获取此日期?我是Java的菜鸟. TIA!
This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified()
to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!
推荐答案
您好,我想您想要的是输入在映射器中的每个输入文件的修改时间.如果是这种情况,您只需要在mpkorstanje解决方案中添加几行即可:
Hi i think that you want is the modification time of each input File that enters in your mapper. If it is the case you just have to add a few lines to the mpkorstanje solution:
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs
.getFileStatus((FileSplit)context.getInputSplit())
.getPath()).lastModified();
通过这几处更改,您可以获取每个inputSlipt的fileStatus并将其添加到键中,以便以后在过程中使用或进行多输出减少并在减少阶段中写入其他地方.
With this few changes you can get the fileStatus of each inputSlipt and you can add it to your key in order to use later in your process or make a multipleOutput reduce and write somewhere else in your reduce phase.
我希望这会有用
这篇关于如何从Hadoop序列文件中获取上次修改日期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!