计算不同文件扩展名的熵以发现数据的随机性? [英] Compute entropy of different file extensions to find randomness of data?

查看:54
本文介绍了计算不同文件扩展名的熵以发现数据的随机性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有不同的文件类型,包括JPEG和jpg,#mp3,#GIF,#MP4,#FLV,M4V,exe,zip等.

I have different file types which includes JPEG and jpg ,#mp3 ,#GIF,#MP4,#FLV, M4V ,exe, zip etc.

  1. 以块为单位获取数据,类似于4k块大小,查找随机性

  1. Take data in blocks , something like 4k block size, find randomness

生成0到1之间的随机分数

Generate randomness score between 0 and 1

尝试根据随机性得分来安排课程.

Try to put classes according to the randomness score.

我们如何才能如上所述找到不同类型文件的熵,以及如何将每个文件的分数在0到1之间缩放.

How can we find entropy of different types of files as mentioned above and can we scale each of the file score between 0 and 1.

推荐答案

+1这是一个非常有趣的问题,这里有一些未经检验的想法,只是从现在开始我就想到了:

+1 very interesting question here few untested ideas just from top of my head right now:

  1. 如何使用相关系数

  1. how about using correlation coefficient

具有相同大小(4K)的均匀随机数据?或先使用 FFT ,然后进行关联...

with uniformly random data of the same size (4K) ? and or using FFT first and then correlate ...

或计算统计属性并以某种方式从中推断...

我知道这是模糊的描述,但可能值得深入探讨...

I know this is vague description but might be worth digging into it...

使用约束

使用霍夫曼编码压缩您的4K数据,然后推断系数之间的比率未压缩和压缩的大小,可以使用对数刻度...

for example compress your 4K data with Huffman coding and infere your coefficient form the ratio between uncompressed and compressed sizes and may be use logarithmic scale ...

我认为最容易实现和合理的结果将来自第三种方法,因为霍夫曼编码和熵密切相关.

I think the most easy to implement and plausible results will be from 3th approach as Huffman coding and entropy are closely related.

[edit1]使用Shannon信息熵

您的建议甚至比Hufman编码更好(即使那两个是紧密相关的).使用香农信息熵 H (以二进制数字为基础)将返回表示您的数据所需的数据每个字的平均位数(经过哈夫曼编码后).因此,从那里得分< 0..1> 只需除以每个单词的位数...

Your suggestion is even better than Hufman encoding (even those 2 are closely related). Using Shannon information entropy H with binary digits as base will return the average amount of bits per word of your data (after Hufman encoding) needed to represent your data. So from there going to score <0..1> just divide by bits per word...

以下是在BYTE上进行熵计算的小型C ++/VCL示例:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
char txt0[]="text Text bla bla bla ..."; 
char txt1[]="abcdefghijklmnopqrstuvwxy";
char txt2[]="AAAAAAAbbbbbccccddddeeeff";
//---------------------------------------------------------------------------
double entropy(BYTE *dat,int siz)
    {
    if ((siz<=0)||(dat==NULL)) return 0.0;
    int i; double H=0.0,P[256],dp=1.0/siz,base=1.0/log(2.0);
    for (i=0;i<256;i++) P[i]=0.0;
    for (i=0;i<siz;i++) P[dat[i]]+=dp;
    for (i=0;i<256;i++)
        {
        if (P[i]==0.0) continue;    // skip non existing items
        if (P[i]==1.0) return 0.0;  // this means only 1 item type is present in data
        H-=P[i]*log(P[i])*base;     // shanon entropy (binary bits base)
        }
    return H;
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt0,entropy(txt0,sizeof(txt0)),entropy(txt0,sizeof(txt0))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt1,entropy(txt1,sizeof(txt1)),entropy(txt1,sizeof(txt1))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt2,entropy(txt2,sizeof(txt2)),entropy(txt2,sizeof(txt2))/8.0));
    }
//-------------------------------------------------------------------------

结果:

txt = "text Text bla bla bla ..." , H = 3.185667 , H/8 = 0.398208
txt = "abcdefghijklmnopqrstuvwxy" , H = 4.700440 , H/8 = 0.587555
txt = "AAAAAAAbbbbbccccddddeeeff" , H = 2.622901 , H/8 = 0.327863

这篇关于计算不同文件扩展名的熵以发现数据的随机性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆