对于文件比较Java中的程序化方法 [英] Programmatical approach in Java for file comparison

查看:221
本文介绍了对于文件比较Java中的程序化方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是比较针对对方两个十六进制文件签名的相似性的最佳方法。

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.

更具体地讲,就是我想要做的是把一个.exe文件的十六进制再presentation和针对一系列病毒签名进行比较。对于这种方法,我打算打破文件(EXE)十六进制再presentation成N个字符的各个群体(即10进制字符),并与病毒特征码这样做。我的目标,以执行某种启发,因此统计检查该EXE文件是否具有相似性对已知病毒特征的X%。

More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.

我想这样做的简单的和可能非常错误的方法是,要比较的exe [N,N-1]针对病毒[N,N-1],其中阵列中的每个元素是一个子阵列,并且因此EXE1 [0,9]针对病毒1 [0,9]。每个子集将被统计梯度

The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.

由于可以实现将有一个数量庞大的比较,因此非常非常慢。所以我想问问你们是否能想出更好的办法做这样的比较,例如,实现不同的数据结构在一起。

As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.

这是一个项目,我做我的学士学位在那里我试图开发一种算法来检测多态的恶意软件,这是整个系统,其中,另一种是基于遗传算法进化静态的只是其中的一部分病毒特征。任何建议,意见或一般信息,如资源都非常欢迎。

定义:多态的恶意软件(病毒,蠕虫,...)保持相同的功能和有效载荷为他们的原始版本,而具有明显不同的结构(变体)。他们实现了由code混淆,从而改变其十六进制签名。一些用于多态性的技术是;格式变更(插入删除空格),变量重命名,声明重排,垃圾code此外,声明中更换(X = 1更改为X = Y / 5,其中Y = 5),交换控制语句。那么像流感病毒发生变异,因此,疫苗接种无效,多态恶意软件变异,以逃避侦查。

Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.

更新:的提醒后,你们给我的问候什么的阅读做;我这样做,但它有点让我困惑了。我发现了几个距离算法,可以适用于我的问题,如;

Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;

  • 最长公共子
  • Levenshtein算法
  • EMBOSS程序包
  • Smith-Waterman算法
  • 在博耶Moore算法
  • 在阿霍Corasick算法

但现在我不知道该用,他们似乎都做他同样的事情用不同的方式。我会继续做研究,这样我能理解每个人更好;但在此同时,你可以给我这可能是更合适,这样我可以我的研究过程中给予优先考虑,并研究它更深您的意见。

But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.

更新2:我结束了使用LCSubsequence,LCSubstring和Levenshtein距离的合并。谢谢大家的建议。

Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.

还有就是成品纸的上 GitHub上

推荐答案

对于这样的,我建议你的算法考虑了生物信息学区。还有一个类似的问题,即你有大文件(基因组序列),其中您正在寻找的某些特征(基因,特别著名的短碱基序列等),设置在那里。

For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).

另外的考虑多态恶意软件,这个部门应该为您提供了很多,因为在生物学上似乎同样很难获得精确匹配。 (不幸的是,我不知道适当近似的搜寻/匹配算法会指出。)

Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)

这这个方向的一个例子是,以适应类似的阿霍Corasick 算法以搜索多个恶意软件签名的同时

One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.

同样,像博耶·摩尔算法的算法给你特别适用于较长的序列梦幻般的搜索运行时(对O(N / M)为大小为N中,你寻找的大小为M,即次线性搜索时间)模式的文本一般情况。

Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).

这篇关于对于文件比较Java中的程序化方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆