HDFS 文件比较 [英] HDFS File Comparison

查看:22
本文介绍了HDFS 文件比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于没有 diff,我如何比较两个 HDFS 文件?

How can I compare two HDFS files since there is no diff?

我正在考虑使用 Hive 表并从 HDFS 加载数据,然后在 2 个表上使用连接语句.有没有更好的办法?

I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?

推荐答案

hadoop 没有提供 diff 命令,但实际上你可以使用 diff 在 shell 中使用重定向代码>命令:

There is no diff command provided with hadoop, but you can actually use redirections in your shell with the diff command:

diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)

如果您只想知道两个文件是否相同而不关心差异,我建议另一种基于校验和的方法:您可以获得两个文件的校验和,然后比较它们.我认为 Hadoop 不需要生成校验和,因为它们已经存储,所以它应该很快,但我可能错了.我不认为有一个命令行选项,但您可以使用 Java API 轻松完成此操作并创建一个小应用程序:

If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:

FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;

这篇关于HDFS 文件比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆