Hadoop 中的校验和验证 [英] Checksum verification in Hadoop
问题描述
在我们通过 Webhdfs 将文件从 Linux 服务器移动到 Hadoop (HDFS) 后,我们是否需要验证校验和?
Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
我想确保 HDFS 上的文件在复制后没有损坏.但是有必要检查校验和吗?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
在数据写入 HDFS 之前,我读取客户端进行校验和
I read client does checksum before data is written to HDFS
有人可以帮助我了解如何确保 Linux 系统上的源文件与使用 webhdfs 的 Hdfs 上的摄取文件相同.
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
推荐答案
如果您的目标是比较驻留在 HDFS 上的两个文件,我不会使用hdfs dfs -checksum URI",因为在我的情况下它会为文件生成不同的校验和内容相同.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
在下面的例子中,我比较了两个在不同位置具有相同内容的文件:
In the below example I am comparing two files with the same content in different locations:
老式的 md5sum 方法返回相同的校验和:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
但是,对于相同内容的文件,HDFS 上生成的校验和是不同的:
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
有点令人费解,因为我希望针对相同的内容生成相同的校验和.
A bit puzzling as I would expect identical checksum to be generated against the identical content.
这篇关于Hadoop 中的校验和验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!