C#-不带元数据的MS Office文档的哈希内容 [英] C# - Hash contents of MS Office documents without metadata

查看：92 发布时间：2020/5/8 0:37:14 c# hash ms-office md5

本文介绍了C#-不带元数据的MS Office文档的哈希内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试识别内容重复的文件.决定使用散列机制(MD5，SHA1或任何其他)进行比较.适用于".txt"文件.但是，对于MS Office文件(.doc，.docx，.xls等)，此操作将失败.

I am trying to identify files with duplicate contents. Decided to do a comparison using a hashing mechanism (MD5, SHA1 or any other). Works fine with ".txt" files. However, with MS Office files (.doc,.docx,.xls, etc) this fails.

MD5/SHA1哈希也不是恒定的.我假设MS Office在文件中存储某种元数据，每次保存文件时该元数据都会更改.因此哈希是不同的.

MD5/SHA1 hash is not constant for MS Office files, even if they have the same "text" content. I assume MS Office stores some kind of meta-data in the file, which changes each time you save the file. Thus the hash is different.

例如我有一个文件ABC.doc，并为此计算了哈希(Hash1).打开并更改1个字并保存文件.撤消所做的更改，然后保存并计算哈希(Hash2). 在这种情况下，Hash1！= Hash2.如果您在".txt"文件上尝试此操作，则相同

e.g. I have a file ABC.doc and I compute the hash (Hash1) for it. Open and change 1 word and save the file. Undo the change you made and save and compute hash (Hash2). Hash1 != Hash2 in this case. It is same if you try this on a ".txt" file

是否有一种基于对内容进行哈希处理来删除MS Office文档的方法?我们可以仅散列文件的内容而不散列其元数据吗?

Is there a way to de-dupe MS Office documents based on hashing its contents? Can we hash only the contents of a file and not its meta-data?

C#-不带元数据的MS Office文档的哈希内容 [英] C# - Hash contents of MS Office documents without metadata

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

C#-不带元数据的MS Office文档的哈希内容 [英] C# - Hash contents of MS Office documents without metadata

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭