对于文件比较Java中的程序化方法 [英] Programmatical approach in Java for file comparison

查看：221 发布时间：2015/11/30 16:06:34 java algorithm data-structures distance file-comparison

本文介绍了对于文件比较Java中的程序化方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

什么是比较针对对方两个十六进制文件签名的相似性的最佳方法。

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.

更具体地讲，就是我想要做的是把一个.exe文件的十六进制再presentation和针对一系列病毒签名进行比较。对于这种方法，我打算打破文件（EXE）十六进制再presentation成N个字符的各个群体（即10进制字符），并与病毒特征码这样做。我的目标，以执行某种启发，因此统计检查该EXE文件是否具有相似性对已知病毒特征的X％。

More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.

我想这样做的简单的和可能非常错误的方法是，要比较的exe [N，N-1]针对病毒[N，N-1]，其中阵列中的每个元素是一个子阵列，并且因此EXE1 [0,9]针对病毒1 [0,9]。每个子集将被统计梯度

The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.

由于可以实现将有一个数量庞大的比较，因此非常非常慢。所以我想问问你们是否能想出更好的办法做这样的比较，例如，实现不同的数据结构在一起。

As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.

这是一个项目，我做我的学士学位在那里我试图开发一种算法来检测多态的恶意软件，这是整个系统，其中，另一种是基于遗传算法进化静态的只是其中的一部分病毒特征。任何建议，意见或一般信息，如资源都非常欢迎。的

定义：多态的恶意软件（病毒，蠕虫，...）保持相同的功能和有效载荷为他们的原始版本，而具有明显不同的结构（变体）。他们实现了由code混淆，从而改变其十六进制签名。一些用于多态性的技术是;格式变更（插入删除空格），变量重命名，声明重排，垃圾code此外，声明中更换（X = 1更改为X = Y / 5，其中Y = 5），交换控制语句。那么像流感病毒发生变异，因此，疫苗接种无效，多态恶意软件变异，以逃避侦查。

Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.

更新：的提醒后，你们给我的问候什么的阅读做;我这样做，但它有点让我困惑了。我发现了几个距离算法，可以适用于我的问题，如;

Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;

最长公共子
Levenshtein算法
EMBOSS程序包
Smith-Waterman算法
在博耶Moore算法
在阿霍Corasick算法

但现在我不知道该用，他们似乎都做他同样的事情用不同的方式。我会继续做研究，这样我能理解每个人更好;但在此同时，你可以给我这可能是更合适，这样我可以我的研究过程中给予优先考虑，并研究它更深您的意见。

But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.

更新2：我结束了使用LCSubsequence，LCSubstring和Levenshtein距离的合并。谢谢大家的建议。

Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.

还有就是成品纸的上 GitHub上

对于文件比较Java中的程序化方法 [英] Programmatical approach in Java for file comparison

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

对于文件比较Java中的程序化方法 [英] Programmatical approach in Java for file comparison

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭