HDFS文件比较 [英] HDFS File Comparison
问题描述
如何比较两个HDFS文件,因为没有 diff
?
我正在考虑使用Hive表格和HDFS加载数据,然后在2个表格上使用连接语句。有没有更好的方法?
没有提供 diff
命令与hadoop,但你可以在你的shell中使用 diff
命令实际使用重定向:
diff<(hadoop fs -cat / path / to / file)<(hadoop fs -cat / path / to / file2)
如果你只是想知道2个文件是否相同而不关心知道差异,我会建议另一种基于校验和的方法:你可以得到两个文件的校验和,然后比较它们。我认为Hadoop不需要生成校验和,因为它们已经存储了,所以它应该很快,但我可能是错的。我不认为这有一个命令行选项,但是你可以用Java API轻松做到这一点,并创建一个小应用程序:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path(/ path / to / file));
chksum2 = fs.getFileChecksum(new Path(/ path / to / file2));
return chksum1 == chksum2;
How can I compare two HDFS files since there is no diff
?
I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?
There is no diff
command provided with hadoop, but you can actually use redirections in your shell with the diff
command:
diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)
If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
这篇关于HDFS文件比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!