HDFS文件校验和 [英] HDFS File Checksum
问题描述
使用Hadoop API-DFSCleint.getFileChecksum()复制到HDFS后,我正在尝试检查文件的一致性.
I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().
对于上面的代码,我得到以下输出:
I am getting the following output for the above code:
Null
HDFS : null
Local : null
任何人都可以指出错误或错误吗?这是代码:
Can anyone point out the error or mistake? Here is the Code :
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
// Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
// Path localPath = new Path("file:///home/ubuntu/derby.log");
// System.out.println("HDFS PATH : "+hdfsPath.getName());
// System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
推荐答案
由于您没有在 conf
上设置远程地址,并且本质上使用相同的配置,因此两个 hadoopFS
和 localFS
指向 LocalFileSystem
的实例.
Since you aren't setting a remote address on the conf
and essentially using the same configuration, both hadoopFS
and localFS
are pointing to an instance of LocalFileSystem
.
getFileChecksum
未针对 LocalFileSystem
实现,并返回null.不过,它应该适用于 DistributedFileSystem
,如果您的 conf
指向分布式集群,则 FileSystem.get(conf)
应该返回一个实例.的 DistributedFileSystem
返回一个 CRC32校验和的MD5的MD5 大小为 bytes.per.checksum
的块.此值取决于块大小和群集范围的配置 bytes.per.checksum
.这就是为什么这两个参数也被编码在分布式校验和的返回值中作为算法名称的原因:MD5-of-xxxMD5-of-yyyCRC32其中xxx是每个块的CRC校验和的数量,yyy是字节.per.checksum
参数.
getFileChecksum
isn't implemented for LocalFileSystem
and returns null. It should be working for DistributedFileSystem
though, which if your conf
is pointing to a distributed cluster, FileSystem.get(conf)
should return an instance of DistributedFileSystem
that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum
. This value depends on the block size and the cluster-wide config, bytes.per.checksum
. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum
parameter.
getFileChecksum
的设计目的不是跨文件系统可比.尽管可以在本地模拟分布式校验和,或手工制作map-reduce作业来计算等效的本地哈希值,但我还是建议依靠Hadoop自身的完整性检查,该检查在文件写入Hadoop或从Hadoop读取文件时进行.
The getFileChecksum
isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop
这篇关于HDFS文件校验和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!