如何测试2个大视频是否相同? [英] How do you test if 2 large videos are identical?

查看:833
本文介绍了如何测试2个大视频是否相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个系统,其中摄取视频文件,然后开始多个CPU密集型任务。由于这些任务的计算量很大,如果文件已经被处理,我想跳过处理。



视频来自各种来源,所以文件名等不是可行的选项。



如果我使用图片,我会比较MD5哈希,但在5GB - 40GB的视频上,这可能需要很长时间才能计算。



要比较我测试此方法的两个视频:




  • 检查相关的元数据匹配

  • 使用ffmpeg / ffprobe检查文件长度

  • 使用ffmpeg以100个预定义的时间戳提取帧[1-100]


  • b 有谁知道一个更有效的方式这样做吗?

    解决方案

    首先,你需要正确定义在哪些条件下两个视频文件被认为是相同。你的意思是完全一样的字节为字节吗?或者你的意思是相同的内容,那么你需要为内容定义一个正确的比较方法。



    我假设第一个(完全相同的文件)。这与文件实际包含的内容无关。当您收到文件时,始终为文件构建哈希,将哈希与文件一起存储。



    检查重复项是一个多步骤的过程:



    1。)比较哈希,如果你找到没有匹配的哈希,文件是新的。在一个新文件的大多数情况下,你可以期望这一步是唯一的步骤,一个好的哈希(SHA1或更大的东西)对于任何实际数量的文件几乎没有冲突。



    2。)如果您发现其他文件具有相同的散列,请检查文件长度。如果两者不匹配,则文件是新的。



    3)如果两者的哈希值和文件长度匹配,您必须比较整个文件内容,当找到第一个区别时停止。



    在最坏的情况下(文件完全相同),这应该不会超过原始IO速度用于读取这两个文件。在最好的情况下(散列不同),测试只需要花费与散列查找一样多的时间(在DB或HashMap或任何你使用的)。



    编辑:都是关心IO来构建哈希。您可以部分避免,如果您先比较文件长度,并跳过文件长度的所有内容是唯一的。另一方面,你还需要跟踪你已经建立的哈希文件。这将允许你推迟建立散列,直到你真的需要它。在缺少散列的情况下,您可以直接跳过比较两个文件,同时在同一个传递中构建散列 。它更多的状态来跟踪,但它可能是值得的,这取决于您的方案(您需要一个坚实的数据基础,重复文件的发生频率和平均大小分布作出决定)。


    I have a system where video files are ingested and then multiple CPU intensive tasks are started. As these tasks are computationally expensive I would like to skip processing a file if it has already been processed.

    Videos come from various sources so file names etc are not viable options.

    If I was using pictures I would compare the MD5 hash but on a 5GB - 40GB video this can take a long time to compute.

    To compare the 2 videos I am testing this method:

    • check relevant metadata matches
    • check length of file with ffmpeg / ffprobe
    • use ffmpeg to extract frames at 100 predfined timestamps [1-100]
    • create MD5 hashes of each of those frames
    • compare the MD5 hashes to check for a match

    Does anyone know a more efficient way of doing this? Or a better way to approach the problem?

    解决方案

    First, you need to properly define under which conditions two video files are considered the same. Do you mean exactly identical as in byte-for-byte? Or do you mean identical in content, then you need to define a proper comparison method for the content.

    I'm assuming the first (exactly identical files). This is independent of what the files actually contain. When you receive a file, always build the a hash for the file, store the hash along with the file.

    Checking for duplicates then is a multi-step process:

    1.) Compare hashes, if you find no matching hash, file is new. In most cases of a new file you can expect this step to be the only step, a good hash (SHA1 or something bigger) will have few collisions for any practical number of files.

    2.) If you found other files with the same hash, check file length. If they don't match, the file is new.

    3.) If both hash and file length matched, you have to compare the entire file contents, stop when you find the first difference. If the entire file compare turns out to be identical the file it the same.

    In the worst case (files are identical) this should take no longer than the raw IO speed for reading the two files. In the best case (hashes differ) the test will only take as much time as the hash lookup (in a DB or HashMap or whatever you use).

    EDIT: You are concerned about the IO to build the hash. You may partially avoid that if you compare the file length first and skip everything of the file length is unique. On the other hand, you then need to also keep track for which files you already did build the hash. This would allow you to defer building the hash until you really need it. In case of a missing hash you could skip directly to comparing the two files, while building the hashes in the same pass. Its a lot more state to keep track of, but it may be worth it depending on your scenario (You need a solid data basis of how often duplicate files occur and their average size distribution to make a decision).

    这篇关于如何测试2个大视频是否相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆