使用md5扫描重复文档 [英] Scan duplicate document with md5

查看:145
本文介绍了使用md5扫描重复文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于某些原因我无法使用 MessageDigest.getInstance(MD5),所以我必须手动编写算法代码,我的项目是扫描重复文档Android设备上的(* .doc,* .txt,* .pdf)。我的问题是,在输入算法之前我必须写什么,扫描Android设备上MY ROOT目录上的重复文档?如果没有选择目录,当我按下按钮扫描时,过程开始, listview 显示。有人可以帮帮我吗?我的项目截止日期即将到来。非常感谢你。

for some reasons I can't use MessageDigest.getInstance("MD5"), so I must write the algorithm code in manual way, my project is scan duplicate document (*.doc, *.txt, *.pdf) on Android device. My question is, what must I write before entering the algorithm, to scan the duplicate document on MY ROOT directory on Android device? Without select the directory, when I press button scan, the process begin, the listview show. Is anyone can help me? My project deadline will come. Thank you so much.

public class MD5 {

//What must I write here, so I allow to scan for duplicate document on Android root with MD5 Hash

//MD5 MANUAL ALGORITHM CODE
}


推荐答案

整个流程:

您的目标是检测(并可能存储有关的信息) )重复文件。

your goal is to detect (and perhaps store information about) duplicate files.

1然后,首先,你必须遍历目录和文件,

1 Then, first, you have to iterate through directories and files,

看到这个:

列出Java中目录和子目录中的所有文件

2,对于每个文件,加载它就像一个字节数组

2 and for each file, to load it like a byte array

看到这个:

将二进制输入流读入Java中的单字节数组

3然后计算您的MD5 - 您的项目

3 then compute your MD5 - your project

4并存储此信息

您可以使用Set来检测重复项(Set具有唯一元素)。

Your can use a Set to dectect duplicates (a Set has unique elements).

Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already

Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set

MD5的答案:

读取算法:
https://en.wikipedia.org/wiki/MD5

RFC: https://www.ietf.org/rfc/rfc1321.txt

一些谷歌搜索...

此演示文稿,一步一步
http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

this presentation, step by step http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

或尝试复制C(或java)实现...

or try to duplicate C (or java) implementation ...

总体策略

为了保持时间并加快处理速度,你还必须考虑使用你的功能:

To keep time and have processus faster, you must also think about the use of your function:


  • 如果你使用它一次,对于一个独特的文件,更好的是r通过在其他文件大小之前选择来引导工作。

  • if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.

如果你经常使用它(并希望快速完成),定期扫描背景中的新文件保持哈希基础是最新的。检测新文件非常简单。

if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.

如果你想要复制所有文件,更好地扫描所有文件,并使用Set Strategy

if you want to get all files duplicated, better scan everything, and use Set Strategy also

希望这会有所帮助

这篇关于使用md5扫描重复文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆