以微秒精度压缩unix时间戳 [英] Compressing unix timestamps with microseconds accuracy

查看:320
本文介绍了以微秒精度压缩unix时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,包括一系列实时unix时间戳微秒精度,即时间戳永远不会减少。需要编码/解码的所有时间戳都来自同一天。文件中的示例条目可能类似于1364281200.078739,对应于从时代起的1364281200078739 usecs。数据不均匀地间隔和有界。



我需要实现约10位/时间戳的压缩。目前我能够通过计算连续时间戳之间的差异来压缩31位/时间戳的平均值。



我们计算的压缩度为(编码文件的大小)以字节为单位)/(时间戳数)* 8。我将时间戳分成两个部分,在'。'之前和之后。整数部分是相当恒定的,两个整数部分时间戳之间的最大差是32,因此我使用0-8位对其进行编码。精度部分是随机的,所以我忽略了前导位,并写入文件使用0-21位(最大可以是999999)。但是我的编码文件的大小是4007674字节,因此压缩为71.05位/ TS。我还写。和两个时间戳之间的空格,以便稍后解码。如何改进我的编码文件的大小?



这里是部分数据集的链接 -



所以52.285%的值是0或1,只有少数64个其他值(2〜6位),27.59%的值是7〜12位,有一个相当均匀的分布,每比特约2.1%高达20位,只有3%高于20位,最多25位。
查看数据,很明显,有多达6个连续零的序列。



这些意见让我想到每个值使用一个可变的位元大小,例如:

 
00 0xxxxx 0(xxxxx是连续零的个数)
00 1xxxxx 1(xxxxx是连续的数字)
01 xxxxxx xxxxxxxx 2-14位值
10 xxxxxx xxxxxxxx xxxxxxxx 15-22位值
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 23-30位值

这导致每个时间戳的压缩率为13.78位,这不是您所针对的10位,而是一个简单方案的开始不好。






在分析了样本数据之后,我发现有很多连续的0和1的短序列,例如 0 1 0 ,所以我用这一个替换了1字节方案:

 
00xxxxxx 00 =标识一个字节值
xxxxxx =序列表中的索引

序列表:

 
index〜seq index〜seq index〜seq index〜seq index〜seq index〜seq
0 0 2 00 6 000 14 0000 30 00000 62 000000
1 1 3 01 7 001 15 0001 31 00001 63 000001
4 10 8 010 16 0010 32 00010
5 11 ... ... ...
11 101 27 1101 59 11101
12 110 28 1110 60 11110
13 111 29 1111 61 11111

对于具有451,210个时间戳的示例文件,大小为676,418字节,或每个时间戳11.99位。



测试上述方法显示,在较大的间隔之间有98,578个单个零和31,271个单一零。所以我尝试使用每个更大的间隔的1位来存储它是否后跟一个零,这将编码大小减少到592,315字节。当我使用2位来存储更大的间隔后是0,1或00(最常见的序列)时,编码大小减少到564,034字节,或每个时间戳10.0004位。

然后更改以存储具有以下大间隔的单个0和1,而不是前一个(纯代码简化的原因),并且发现这导致文件大小为563.884字节,或者每个时间戳 9.997722位



所以完整的方法是:

 
存储第一个时间戳),然后将间隔存储为:

00 iiiiii最多5(或6)个零或序列的序列
01 XXxxxx xxxxxxxx 2-12位值(2〜4,095)
10 XXxxxx xxxxxxxx xxxxxxxx 13-20位值(4,096〜1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx 21-28位值(1,048,576〜268,435,455)

iiiiii =序列表中的索引见上文)
XX =前面有一个零(如果XX = 1),一个(如果XX = 2)或两个零(如果XX = 3)
xxx ... = 12,20 28位值

编码器示例:

  #include< stdint.h> 
#include< iostream>
#include< fstream>
using namespace std;

void write_timestamp(ofstream& ofile,uint64_t timestamp){// big-endian
uint8_t bytes [8];
for(int i = 7; i> = 0; i--,timestamp>> = 8)bytes [i] = timestamp;
ofile.write((char *)bytes,8);
}

int main(){
ifstream ifile(timestamps.txt);
if(!ifile.is_open())return 1;
ofstream ofile(output.bin,ios :: trunc | ios :: binary);
if(!ofile.is_open())return 2;

long double seconds;
uint64_t timestamp;

if(ifile>>秒){
timestamp = seconds * 1000000;
write_timestamp(ofile,timestamp);
}

while(!ifile.eof()){
uint8_t bytesize = 0,len = 0,seq = 0,bytes [4]
uint32_t interval;

while(bytesize == 0& ifile>>秒){
interval = seconds * 1000000 - timestamp;
timestamp + = interval;

if(interval <2){
seq< = 1; seq | = interval;
if(++ len == 5&& seq> 0 || len == 6)bytesize = 1;
} else {
while(interval>> ++ bytesize * 8 + 4);
for(uint8_t i = 0; i <= bytesize; i ++){
bytes [i] = interval> (bytesize-i)* 8;
}
bytes [0] | =(bytesize ++<< 6);
}
}
if(len){
if(bytesize> 1&(len == 1 || len == 2&& seq = = 0)){
字节[0] | =(2 * len + seq-1) 4;
} else {
seq + =(1<< len) - 2;
ofile.write((char *)& seq,1);
}
}
if(bytesize> 1)ofile.write((char *)bytes,bytesize);
}
ifile.close();
ofile.close();
return 0;
}

解码器示例:

  #include< stdint.h> 
#include< iostream>
#include< fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile){// big-endian
uint64_t timestamp = 0;
uint8_t byte;
for(uint8_t i = 0; i <8; i ++){
ifile.read((char *)& byte,1);
if(ifile.fail())return 0;
timestamp<< = 8; timestamp | = byte;
}
return timestamp;
}

uint8_t read_interval(ifstream& ifile,uint8_t * bytes){
uint8_t bytesize = 1;
ifile.read((char *)bytes,1);
if(ifile.fail())return 0;
bytesize + = bytes [0]> 6;
for(uint8_t i = 1; i< bytesize; i ++){
ifile.read((char *)bytes + i,1);
if(ifile.fail())return 0;
}
return bytesize;
}

void write_seconds(ofstream& ofile,uint64_t timestamp){
long double seconds =(long double)timestamp / 1000000;
ofile<<秒< \\\
;
}

uint8_t write_sequence(ofstream& ofile,uint8_t seq,uint64_t timestamp){
uint8_t interval = 0,len = 1,offset = 1;
while(seq> =(offset<< = 1)){
seq - = offset;
++ len;
}
while(len--){
interval + =(seq>> len)& 1;
write_seconds(ofile,timestamp + interval);
}
return interval;
}

int main(){
ifstream ifile(timestamps.bin,ios :: binary);
if(!ifile.is_open())return 1;
ofstream ofile(output.txt,ios :: trunc);
if(!ofile.is_open())return 2;
ofile.precision(6); ofile<< std :: fixed;

uint64_t timestamp = read_timestamp(ifile);
if(timestamp)write_seconds(ofile,timestamp);

while(!ifile.eof()){
uint8_t bytes [4],seq = 0,bytesize = read_interval(ifile,bytes);
uint32_t interval;

if(bytesize == 1){
timestamp + = write_sequence(ofile,bytes [0],timestamp);
}
else if(bytesize> 1){
seq =(bytes [0]>> 4)& 3;
if(seq)timestamp + = write_sequence(ofile,seq - 1,timestamp);
interval = bytes [0]& 15;
for(uint8_t i = 1; i< bytesize; i ++){
interval< = 8; interval + = bytes [i];
}
timestamp + = interval;
write_seconds(ofile,timestamp);
}
}
ifile.close();
ofile.close();
return 0;
}

因为长双输出错误在MinGW / gcc 4.8.1编译器我使用,我不得不使用这个解决方法:(这不应该与其他编译器)

  void write_seconds(ofstream& ofile,uint64_t timestamp){
long double seconds =(long double)timestamp / 1000000;
ofile<< 1<< (双)(秒-1000000000)<< \\\
;
}

未来读者注意:此方法基于示例数据文件;如果您的数据不同,则不会给出相同的压缩率。


I have file that consists of a sequence of real time unix timestamps with microsecond accuracy, i.e. the timestamps can never decrease. All the timestamps that need to be coded/decoded are from the same day. A sample entry in the file might be something like 1364281200.078739 that corresponds to 1364281200078739 usecs since epoch. Data is unevenly spaced and bounded.

I need to achieve a compression of around 10 bits/timestamp. Currently I am able to compress to average of 31 bits/timestamp by calculating difference between consecutive timestamps. How can I improve further ?

Edit:

We are calculating Degree of Compression as (Size of encoded file in bytes)/(Number of timestamps)*8. I divided the timestamps into two parts before '.' and after it. The integer part is quite constant and max difference between two integer part timestamps is 32 so I encoded it using 0-8 bits. The precision part is quite random so I have ignored the leading bits and wrote into file using 0-21 bits(as max it can be 999999). But the size of my encoded file is coming as 4007674 bytes and hence compression as 71.05 bits/TS. I also write '.' and a space between two timestamps to decode later. How can I improve upon my size of encoded file ?

Here is the link for partial data set - http://pastebin.com/QBs9Bqv0

Here is the link for differential timestamps value in micro-seconds - http://pastebin.com/3QJk1NDV Maximum difference b/w timestamps is - 32594136 micro sec.

解决方案

If you take the interval between every timestamp and the previous one and express it in microseconds (i.e. as integers), the distribution of values per bit depth in your sample file is:

So 52.285% of the values are either 0 or 1, there are only a handful of other values below 64 (2~6-bits), 27.59% of the values are 7~12-bits, there's a fairly even distribution of around 2.1% per bit up to 20-bits, and only 3% above 20-bits, with a maximum of 25-bits. Looking at the data, it's also obvious that there are many sequences of up to 6 consecutive zeros.

These observations gave me the idea of using a variable bit-size per value, something like this:

00 0xxxxx                             0 (xxxxx is the number of consecutive zeros)
00 1xxxxx                             1 (xxxxx is the number of consecutive ones)
01 xxxxxx xxxxxxxx                    2-14 bit values
10 xxxxxx xxxxxxxx xxxxxxxx           15-22 bit values
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  23-30 bit values

A quick test showed that this resulted in a compression rate of 13.78 bits per timestamp, which isn't quite the 10 bits you were aiming for, but not a bad start for a simple scheme.


After analysing the sample data some more, I observed that there are a lot of short sequences of consecutive 0's and 1's, like 0 1 0, so I replaced the 1-byte scheme with this one:

00xxxxxx    00     = identifies a one-byte value
            xxxxxx = index in the sequence table

The table of sequences:

index ~ seq    index ~ seq    index ~ seq    index ~ seq    index ~ seq     index ~ seq
  0      0       2     00       6     000     14    0000     30   00000     62   000000
  1      1       3     01       7     001     15    0001     31   00001     63   000001
                 4     10       8     010     16    0010     32   00010    
                 5     11         ...            ...           ...
                               11     101     27    1101     59   11101    
                               12     110     28    1110     60   11110    
                               13     111     29    1111     61   11111    

For the example file with 451,210 timestamps, this brings down the encoded file size to 676,418 bytes, or 11.99 bits per timestamp.

Testing the above method revealed that there were 98,578 single zeros and 31,271 single ones between larger intervals. So I tried using 1 bit of each larger interval to store whether it was followed by a zero, which decreased the encoded size to 592,315 bytes. And when I used 2 bits to store whether larger intervals were followed by 0, 1 or 00 (the most common sequence), the encoded size decreased to 564,034 bytes, or 10.0004 bits per timestamp.
I then changed to storing the single 0's and 1's with the following large interval instead of the preceding one (purely for reasons of code simplicity) and found that this resulted in a file size of 563.884 bytes, or 9.997722 bits per timestamp!

So the complete method is:

Store the first timestamp (8 bytes), then store the intervals as either: 

00 iiiiii                             sequences of up to 5 (or 6) zeros or ones
01 XXxxxx xxxxxxxx                     2-12 bit values (2 ~ 4,095)
10 XXxxxx xxxxxxxx xxxxxxxx           13-20 bit values (4,096 ~ 1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx  21-28 bit values (1,048,576 ~ 268,435,455)

iiiiii = index in sequence table (see above)
XX     = preceded by a zero (if XX=1), a one (if XX=2) or two zeros (if XX=3)
xxx... = 12, 20 or 28 bit value

Example of an encoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

void write_timestamp(ofstream& ofile, uint64_t timestamp) {    // big-endian
    uint8_t bytes[8];
    for (int i = 7; i >= 0; i--, timestamp >>= 8) bytes[i] = timestamp;
    ofile.write((char*) bytes, 8);
}

int main() {
    ifstream ifile ("timestamps.txt");
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.bin", ios::trunc | ios::binary);
    if (! ofile.is_open()) return 2;

    long double seconds;
    uint64_t timestamp;

    if (ifile >> seconds) {
        timestamp = seconds * 1000000;
        write_timestamp(ofile, timestamp);
    }

    while (! ifile.eof()) {
        uint8_t bytesize = 0, len = 0, seq = 0, bytes[4];
        uint32_t interval;

        while (bytesize == 0 && ifile >> seconds) {
            interval = seconds * 1000000 - timestamp;
            timestamp += interval;

            if (interval < 2) {
                seq <<= 1; seq |= interval;
                if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
            } else {
                while (interval >> ++bytesize * 8 + 4);
                for (uint8_t i = 0; i <= bytesize; i++) {
                    bytes[i] = interval >> (bytesize - i) * 8;
                }
                bytes[0] |= (bytesize++ << 6);
            }
        }
        if (len) {
            if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
                bytes[0] |= (2 * len + seq - 1) << 4;
            } else {
                seq += (1 << len) - 2;
                ofile.write((char*) &seq, 1);
            }
        }
        if (bytesize > 1) ofile.write((char*) bytes, bytesize);
    }
    ifile.close();
    ofile.close();
    return 0;
}

Example of a decoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile) {    // big-endian
    uint64_t timestamp = 0;
    uint8_t byte;
    for (uint8_t i = 0; i < 8; i++) {
        ifile.read((char*) &byte, 1);
        if (ifile.fail()) return 0;
        timestamp <<= 8; timestamp |= byte;
    }
    return timestamp;
}

uint8_t read_interval(ifstream& ifile, uint8_t *bytes) {
    uint8_t bytesize = 1;
    ifile.read((char*) bytes, 1);
    if (ifile.fail()) return 0;
    bytesize += bytes[0] >> 6;
    for (uint8_t i = 1; i < bytesize; i++) {
        ifile.read((char*) bytes + i, 1);
        if (ifile.fail()) return 0;
    }
    return bytesize;
}

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << seconds << "\n";
}

uint8_t write_sequence(ofstream& ofile, uint8_t seq, uint64_t timestamp) {
    uint8_t interval = 0, len = 1, offset = 1;
    while (seq >= (offset <<= 1)) {
        seq -= offset;
        ++len;
    }
    while (len--) {
        interval += (seq >> len) & 1;
        write_seconds(ofile, timestamp + interval);
    }
    return interval;
}

int main() {
    ifstream ifile ("timestamps.bin", ios::binary);
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.txt", ios::trunc);
    if (! ofile.is_open()) return 2;
    ofile.precision(6); ofile << std::fixed;

    uint64_t timestamp = read_timestamp(ifile);
    if (timestamp) write_seconds(ofile, timestamp);

    while (! ifile.eof()) {
        uint8_t bytes[4], seq = 0, bytesize = read_interval(ifile, bytes);
        uint32_t interval;

        if (bytesize == 1) {
            timestamp += write_sequence(ofile, bytes[0], timestamp);
        }
        else if (bytesize > 1) {
            seq = (bytes[0] >> 4) & 3;
            if (seq) timestamp += write_sequence(ofile, seq - 1, timestamp);
            interval = bytes[0] & 15;
            for (uint8_t i = 1; i < bytesize; i++) {
                interval <<= 8; interval += bytes[i];
            }
            timestamp += interval;
            write_seconds(ofile, timestamp);
        }
    }
    ifile.close();
    ofile.close();
    return 0;
}

Because of the long double output bug in the MinGW/gcc 4.8.1 compiler I'm using, I had to use this workaround: (this shouldn't be necessary with other compilers)

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << "1" << (double) (seconds - 1000000000) << "\n";
}

Note to future readers: this method is based on analysis of an example data file; it will not give the same compression rate if your data is different.

这篇关于以微秒精度压缩unix时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆