csv文件压缩,而不使用Python中的现有库 [英] csv file compression without using existing libraries in Python

查看:229
本文介绍了csv文件压缩,而不使用Python中的现有库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图压缩.csv文件,而不使用任何第三方或框架提供的压缩库。



我试过,我想想, 。
我查看了哈夫曼,但由于我不允许使用该解决方案,我试图自己做。



例如:

  6NH8,F,A,0,60541567,60541567,78.78,20 
6NH8,F,A,0,60541569,60541569, 78.78,25
6AH8,F,B,0,60541765,60541765,90.52,1
QMH8,F,B,0,60437395,60437395,950.5,1

我做了一个算法来计算每个字符,并给出它们被使用的次数,

 ','--- 28 
'5'--- 18
'6 '--- 17
'0'--- 15
'7'--- 10
'8'--- 8
'4'--- 8
'1'--- 8
'9'--- 6
'。'--- 4
'3'--- 4
'\ n'--- 4
'H'--- 4
'F'--- 4
'2'--- 3
'A'--- 3
'N'--- 2
'B'--- 2
'M'--- 1
'Q'--- 1

[(',',0),('5',1),('6',2),('0',3),('7',4) ,
('4',6),('1',7),('9',8) ',11),
('H',12),('F',13),('2',14) 'b',17),
('M',18),('Q',19)]


b $ b

因此,不是存储ord('H')= 72,我给H赋值12,等等。



我改变所有的字符为我的值,我生成的cvs(> 40MB)仍然大于原来的(19MB)。



我甚至尝试了替代品,

  [6NH8,F,A,0,] 
[ 60541567,60541567,78.78,20]

但更大,甚至比我的huffman版本更大。



QUESTION
任何人都有关于如何
的任何建议1.读取.csv文件,
2.使用一个lib。或第三方。
3.生成并写一个较小的.csv文件?



对于步骤2,Im不要求一个完整的计算解决方案,只是建议如何最小化文件,通过将每个值写为一个列表?



谢谢

解决方案

目前尚不清楚是否需要创建一个通用的压缩算法或对这种数据工作相当好的自定义算法。



还不清楚输出是否应该是另一个CSV,字符串的可打印ASCII字符或纯二进制数据。



我想假设我们在谈论自定义算法和CSV输出。 (同样的原则也适用于另一种输出格式。)



看起来你的输入格式很好,总是重复同样的字段:

  0'6NH8':4个字符的代码
1'F':字符
2'A':字符
3'0':integer
4'60541567':integer \_某种
5'60541567':整数/时间戳?
6 '78 .78':float
7'20':integer

strong>建立字典



查看列#0中使用多少个不同的代码,以及'列#1'+'列#2



如果经常使用相同的值,那么建立一个只存储一次,然后在压缩的行中引用的字典绝对值得。



例如:

  column0_dictionary = ['6NH8','6AH8' 'QMH8'] 
column12_dictionary = ['FA','FB'];

因此, 6NH8 code> 0 , 6AH8 as 1

以同样的方式, F,A 将被引用为 0 F,B 作为 1



以较短格式编码时间戳



假设第4列和第5列确实是时间戳,则快速胜利将是存储最小值,

  minimum_timestamp = 60437395 

因此,60541569成为60541569 - 60437395 = 104174。





以下是我们在您的示例输入中应用这两个简单方法时得到的结果:

 #header 
6NH8,6AH8,QMH8
FA,FB
60437395
#有效载荷数据
0,0,0,104172,104172,78.78 ,20
0,0,0,104174,104174,78.78,25
1,1,0,104370,104370,90.52,1
2,1,0,0,0,950.5, 1

您还可以在列#5中存储列#5和列#4之间的差异,如果它们对应于开始的东西和结束的东西。



因为,压缩的有效负载的大小是大约70%的原始输入的大小。 (请记住,当您有更多行时,标题的大小应该可以忽略不计。)



您的示例太短,无法检测其他任何明显模式



事实证明,时间戳以从午夜过去的毫秒数来表示。因此,它们可能均匀分布在0-86399999中,并且不可能减去最小值。



然而,这些数字可以以比ASCII表示更紧凑的方式编码



最简单的方法是将它们转换为十六进制:

  60541567 = 39BCA7F 

一个稍微复杂一点的方法是在Base64中编码:


  1. 将时间戳转换为其4字节表示形式(所有值从0到86399999将适合4字节):

    li>
  2. 创建一个由4个相应字符组成的字符串并在Base64中编码。



$ b b

例如:

  60541567 = 03 9B CA 7F#十六进制和大尾数顺序

BASE64(CHR(0x03)+ CHR(0x9B)+ CHR(0xCA)+ CHR(0x7F))= A5vKfw
这里没有填充字符
pre>

I'm trying to compress a .csv file without using any 3rd party or framework provided compression libraries.

I have tried, what I wish to think, everything. I looked at Huffman, but since I'm not allowed to use that solution I tried to do my own.

An example:

6NH8,F,A,0,60541567,60541567,78.78,20
6NH8,F,A,0,60541569,60541569,78.78,25
6AH8,F,B,0,60541765,60541765,90.52,1
QMH8,F,B,0,60437395,60437395,950.5,1

I made an algorithm that counts every char and gives me amount of times they've been used and, depending on how many time they been dedicated a number.

',' --- 28  
'5' --- 18
'6' --- 17 
'0' --- 15
'7' --- 10  
'8' --- 8 
'4' --- 8  
'1' --- 8
'9' --- 6  
'.' --- 4
'3' --- 4 
'\n'--- 4 
'H' --- 4    
'F' --- 4
'2' --- 3 
'A' --- 3
'N' --- 2 
'B' --- 2 
'M' --- 1 
'Q' --- 1 

[(',', 0), ('5', 1), ('6', 2), ('0', 3), ('7', 4), ('8', 5), 
 ('4', 6), ('1', 7), ('9', 8), ('.', 9), ('3', 10), ('\n', 11), 
 ('H', 12), ('F', 13), ('2', 14), ('A', 15), ('N', 16), ('B', 17), 
 ('M', 18), ('Q', 19)]

So instead of storing for example ord('H') = 72, I give H the value 12, and so on.

But, when I change all the chars to my values, my generated cvs(>40MB) is still larger than original(19MB).

I even tried the alternatives to divide the list into 2. i.e. for one row make it two rows.

[6NH8,F,A,0,]
[60541567,60541567,78.78,20]

But still larger, even larger than my "huffman" version.

QUESTION: Anybody have any suggestions on how to 1.Read a .csv file, 2.use something thats a lib. or 3rd party. 3.generate and write a smaller .csv file?

For step 2 Im not asking for a full computed solution, just suggestions of how to minimize the file, by i.e. write each value as one list ? etc.

Thank you

解决方案

It is unclear whether you need to create a generic compression algorithm or a custom one that works reasonably well for this kind of data.

It is also unclear whether the output should be another CSV, a string made of printable ASCII characters or plain binary data.

I'm going to assume that we're talking about a custom algorithm and a CSV output. (The same principles would apply to another output format anyway.)

It appears that your input is well formatted and always repeat the same kind of fields:

0 '6NH8'     : 4-character code
1 'F'        : character
2 'A'        : character
3 '0'        : integer
4 '60541567' : integer \_ some kind of
5 '60541567' : integer /  timestamps?
6 '78.78'    : float
7 '20'       : integer

Building dictionaries

See how many distinct codes are used in column #0 and how many distinct combinations of 'column #1' + 'column #2' you have.

If the same values are used frequently, then it's definitely worth building dictionaries that will be stored only once and then referenced in the compressed rows.

For instance:

column0_dictionary = [ '6NH8', '6AH8', 'QMH8' ]
column12_dictionary = [ 'FA', 'FB' ];

So, 6NH8 would be referenced as 0, 6AH8 as 1, etc.

In the same way, F,A would be referenced as 0 and F,B as 1.

Encoding timestamps in a shorter format

Assuming that columns #4 and #5 are indeed timestamps, a quick win would be to store the minimum value and subtract it from the actual value in each compressed row.

minimum_timestamp = 60437395

Therefore, 60541569 becomes 60541569 - 60437395 = 104174.

Example output

Here is what we get when applying these two simple methods to your example input:

# header
6NH8,6AH8,QMH8
FA,FB
60437395
# payload data
0,0,0,104172,104172,78.78,20
0,0,0,104174,104174,78.78,25
1,1,0,104370,104370,90.52,1
2,1,0,0,0,950.5,1

You could also store in column #5 the difference between column #5 and column #4, if it turns out that they correspond to the 'start of something' and 'end of something'.

As is, the size of the compressed payload is about 70% of the size of the original input. (Keep in mind that the size of the header should become negligible when you have much more rows.)

Your example is too short to detect any other obvious patterns for the remaining fields, but hopefully these examples will give you some ideas.

UPDATE

It turns out that the timestamps are expressed in number of milliseconds elapsed since midnight. So they are probably evenly distributed in 0-86399999 and it's not possible to subtract a minimum.

These numbers can however be encoded in a more compact manner than the ASCII representation of their decimal value.

The easiest way is to convert them to hexadecimal:

60541567 = 39BCA7F

A slightly more complicated way is to encode them in Base64:

  1. Convert timestamp to its 4-byte representation (all values from 0 to 86399999 will fit in 4 bytes):

  2. Build a string made of the 4 corresponding characters and encode it in Base64.

For example:

60541567 = 03 9B CA 7F  # in hexadecimal and big-endian order

BASE64(CHR(0x03) + CHR(0x9B) + CHR(0xCA) + CHR(0x7F)) = A5vKfw
# here without the padding characters

这篇关于csv文件压缩,而不使用Python中的现有库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆