计算二进制数据相似性 [英] Calculating Binary Data Similarity

查看:190
本文介绍了计算二进制数据相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里看到了一些与确定文件的相似性相关的问题,但它们都链接到特定的域(图像,声音,文本等)。作为解决方案提供的技术需要知道正在比较的文件的基本文件格式。我正在寻找的是一个没有这个要求的方法,其中可以比较任意二进制文件,而不需要了解它们包含什么类型的数据。也就是说,我正在寻找确定两个文件的二进制数据的相似度百分比。

I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage of two files' binary data.

为了给您一些更多的细节虽然这可能适用于许多事情,但我确实有一个具体的问题,我正在努力。我也目前有一个工作的解决方案,但我不认为它是理想的。在比较方法方面可能有许多优化,并存储结果。希望这里的一些人能够给我一些新的想法。我可能会在几天后编辑一些有关我当前方法的信息,但我不想偏见人们对这个问题的想法,告诉你我已经做了。

To give a little more detail for you to work with, even though this is potentially applicable to many things, I do have a specific problem that I'm working on. I also currently have a working solution, but I don't think that it is ideal. There are probably many optimizations in terms of the comparison method, and storing the results. Hopefully some people here will be able to give me some new ideas. I will probably edit in some information about my current method after a couple of days, but I don't want to bias peoples' thoughts about the problem by telling you how I'm already doing it.

我正在处理的问题是视频游戏ROM映像的克隆检测。对于那些没有仿真经验的人,ROMs是转储的游戏盒上的数据。 ROM克隆通常是相同游戏的修改版本,最常见的类型是翻译版本。例如,NES的原始最终幻想的日语和英语版本是克隆。游戏共享几乎所有的资源(精灵,音乐等),但文字已被翻译。

The problem I'm working on is clone detection for video game ROM images. For those that don't have experience with emulation, ROMs are dumps of the data on game cartridges. A ROM "clone" is typically a modified version of the same game, the most common type being a translated version. For example, the Japanese and English versions of the original Final Fantasy for the NES are clones. The games share almost all of their assets (sprites, music, etc), but the text has been translated.

目前有几个小组在维护克隆列表对于各种系统,但据我所知,这一切都是手动完成的。我试图做的是找到一种方法来自动和客观地检测类似的ROM图像,基于数据相似性,而不是这些看起来像同一个游戏。检测克隆有几个原因,但其中一个主要动机是与固体压缩。这允许将所有游戏克隆一起压缩到相同的归档中,整个压缩克隆集合通常仅占用比单个ROM之一稍多的空间。

There are currently several groups that work on maintaining lists of clones for the various systems, but as far as I can tell, this is all done manually. What I am attempting to do is find a method to detect similar ROM images automatically and objectively, based on data similarity instead of "these seem like the same game". There are several reasons for detecting clones, but one of the major motivations is to be used with Solid compression. This allows compression of all game clones together into the same archive, with the entire compressed clone set often taking up only slightly more space than one of the individual ROMs.

在考虑可能的方法时考虑:

Some concerns to consider when coming up with potential approaches:


  • 根据系统的不同,ROM的大小差异很大。有些是小的,但现代系统可能有大的,256MB或更多。有些(所有?)系统只有2的大小作为可能的大小,在这些系统之一上的130MB游戏将有一个256MB的rom,基本上是空的。请注意,由于这个原因,如果游戏版本超过了阈值,并且必须使用两倍大小的盒式磁带,那么一些克隆可能具有非常不同的大小。

  • 目前有数千个已知ROM在许多系统上,大多数系统仍然有新的不断发布。

  • 为每个可能的ROM对存储相似性数据将导致数百万行数据的任何一个更受欢迎的系统。具有5000个ROM的系统将需要2500万行相似性数据,一个新游戏添加另外5000行。

  • 处理状态必须是可恢复的,以便如果它被中断可以拿起它离开的地方。使用任何方法,将需要大量的处理,并且假定整个事物将在一个批次中运行是不安全的。

  • 可以随时添加新的ROM,不应该假定它已经有一个完整集合。也就是说,即使在已经确定了所有现有ROM之间的相似性之后,如果添加了新的(并且这也可以在先前的处理完全完成之前发生),则必须存在用于将其与先前的所有ROM进行比较的方法,以确定

  • 更高的处理速度应优先于精度(到一点)。知道两个ROM是94%还是96%相似并不是特别重要,但是如果需要一天的处理来将新的ROM与所有先前的ROM进行比较,该程序可能永远不会真正完成。

  • ROMs vary highly in size, depending on the system. Some are small, but modern systems may have large ones, 256MB or more. Some (all?) systems only have powers of 2 as possible sizes, a 130MB game on one of these systems would have a 256MB rom, largely empty. Note that because of this, some clones may have wildly different sizes, if a game version crosses the threshold and has to use a cartridge that is twice the size.
  • There are currently thousands of known ROMs on many systems, with most systems still having new ones released constantly. Even for older systems, there is a major ROM-hacking community that produces modified ROMs often.
  • Storing similarity data for every possible pair of ROMs would result in millions of rows of data for any of the more popular systems. A system with 5000 ROMs would require 25 million rows of similarity data, with a single new game adding another 5000 rows.
  • State of the processing must be recoverable, so that if it is interrupted it can pick up where it left off. With any method, a lot of processing will be required, and assuming that the whole thing will run in one batch is not safe.
  • New ROMs could be added at any time, so the method should not assume that it already has a "complete" set. That is, even after you have already figured out similarity for all existing ROMs, if a new one is added (and this could also occur before previous processing was entirely finished) there must be a method for comparing it to all previous ones, to determine which (if any) it is a clone of.
  • Higher processing speed should be given priority over accuracy (to a point). Knowing whether two ROMs are 94% or 96% similar is not particularly important, but if it takes a day of processing to compare a new ROM to all the previous ones, the program would probably never truly complete.

这是一个有趣的问题,我期待看到其他人可以想出。

It's been an interesting problem to work on, I look forward to seeing what other people can come up with. Let me know in the comments if you want any more details, and I'll try to supply them.

推荐答案

这听起来像是你想要一个二进制三角洲,或者从一个二进制三角洲的应用衍生的索引(像它的大小)。然后,您可以将此索引与您通过实验确定的某个基线进行比较,以决定是否为克隆。

It sounds like you want a binary delta or perhaps an index derived from the application of a binary delta (like it's size). You could then compare this index to some baseline that you determine experimentally to decide if it's a "clone" or not.

压缩和增量创建之间有很多相似之处,所以我想说你离你目前的实现不太远。

There are a lot of similarities between compression and delta creation, so I'd say you aren't far off with your current implementation.

也就是说,成对比较每个二进制文件在你的数据库可能是昂贵的O(n 2),我想)。我会尝试找到一个简单的哈希用于识别可能的候选人进行比较。概念上类似于spdenne和Eduard建议的东西。也就是说,找到一个可以应用于每个项的哈希,对该列表进行排序,然后对哈希在列表中靠得很近的项目使用更细粒度的比较。

That being said, pairwise comparison of every binary file in your database is probably prohibitively expensive (O(n2), I think). I would try to find a simple hash for identifying possible candidates for comparison. Something conceptually similar to what spdenne and Eduard are suggesting. That is, find a hash which can be applied to every item once, sort that list and then use a finer grained comparison on items whose hashes are close together in the list.

构建对一般情况有用的哈希已经是CS几年来积极研究的主题。 LSHKit 软件库实现了此类算法。互联网访问卡片在大文件系统中查找类似文件< a>似乎可能更多的目标是比较文本文件,但可能对你有用。最近的论文多分辨率相似性哈希描述了更强大的算法。它似乎不可访问,没有订阅,虽然。在浏览其他资源时,您可能希望随时掌握维基百科文章的位置敏感哈希。他们都得到相当的技术,维基百科条目本身是非常数学很重。作为一种更加用户友好的替代方法,您可以应用领域中的一些想法(甚至是可执行文件)声指纹

Constructing hashes useful for the general case has been an actively pursued research topic in CS for several years. The LSHKit software library implements some algorithms of this sort. The internet accessible paper FINDING SIMILAR FILES IN A LARGE FILE SYSTEM seems like it might be targeted more at comparing text files but might be useful to you. The more recent paper Multi-resolution similarity hashing describes a more powerful algorithm. It doesn't appear to be accessible without a subscription, though. You probably want to keep the wikipedia article on Locality Sensitive Hashing handy as you browse the other resources. They all get pretty technical and the wikipedia entry itself is pretty math heavy. As a more user-friendly alternative you might be able to apply some ideas (or even executables) from the field of Acoustic Fingerprinting.

如果您愿意放弃一般情况,您可能会发现一个更简单(和更快)的域特定哈希函数只适用于你的ROM。可能涉及放置标准或公共字节序列和它们附近的选择位的值。我真的不太了解你的二进制格式,但我想象的事情,信号的文件中的区域的开始,像声音,图像或文本的区域。二进制格式经常存储文件开头附近的这些类型的段的地址。一些还使用链接机制,其将第一部分的地址与其大小一起存储在已知位置。这允许你移动到下一个部分,也包含一个大小等。一个小的调查可能会允许你发现任何相关的格式,如果你还没有意识到,并应该让你很好的建设一个有用的哈希。

If you're willing to abandon the general case it's likely that you can find a much simpler (and faster) domain-specific hash function that works just for your ROMs. Possibly something involving the placement of standard, or common, byte sequences and the value of select bits near them. I don't really know much about your binary format but I'm imagining things that signal the start of sections in the file like regions for sound, images or text. Binary formats frequently store the addresses of these sorts of sections near the beginning of the file. Some also use a chaining mechanism that stores the address of the first section at a known location along with it's size. This allows you to move to the next section which also contains a size, etc. A little investigation will probably allow you to discover any relevant formatting, if you're not already aware of it, and should put you well on your way to constructing a useful hash.

如果哈希函数没有得到你所有的(或者他们需要某种类型的输入来定义一个公制/距离)二进制增量算法和在网络上可用的实现。我最熟悉的一个是由subversion版本控制系统使用。它使用称为xdelta的二进制delta算法来有效地存储二进制文件修订版本。以下是指向其实现它的存储库中的文件的链接: xdelta.c 。网路上可能有一个工具,让这个工具也更容易存取。

If the hash functions don't get you all the way (or they require input of some sort to define a metric/distance) then there are several binary delta algorithms and implementations available on the web. The one I'm most familiar with is used by the subversion version control system. It uses a binary delta algorithm called xdelta to efficiently store binary file revisions. Here's a link directly to the file in their repository that implements it: xdelta.c. There's probably a tool on the web that makes this more accessible as well.

这篇关于计算二进制数据相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆