解压缩并读取Dukascopy .bi5刻度文件 [英] Decompress and read Dukascopy .bi5 tick files
问题描述
我需要打开一个.bi5
文件并阅读内容,以缩短长篇幅.问题是:我有成千上万个.bi5
文件,其中包含我需要解压缩和处理(读取,转储到熊猫中)的时序数据.
I need to open a .bi5
file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5
files containing time-series data that I need to decompress and process (read, dump into pandas).
我最终专门为lzma
库安装了Python 3(我通常使用2.7),因为我遇到了使用python 2.7的lzma
反向端口进行编译的噩梦,所以我承认并使用Python 3,但没有成功.问题太多了,无法解决,没人读过冗长的问题!
I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma
library, as I ran into compiling nightmares using the lzma
back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!
我已经包含了.bi5
个文件之一,如果有人可以设法将其放入Pandas Dataframe中并向我展示他们是如何做到的,那将是理想的选择.
I have included one of the .bi5
files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.
ps fie只有几kb,它将在一秒钟内下载.首先十分感谢.
ps the fie is only a few kb, it will download in a second. Thanks very much in advance.
(文件) http://www.filedropper.com/13hticks
推荐答案
下面的代码可以解决问题.首先,它打开一个文件并在 lzma 中对其进行解码,然后使用结构以解压缩二进制数据.
The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.
import lzma
import struct
import pandas as pd
def bi5_to_df(filename, fmt):
chunk_size = struct.calcsize(fmt)
data = []
with lzma.open(filename) as f:
while True:
chunk = f.read(chunk_size)
if chunk:
data.append(struct.unpack(fmt, chunk))
else:
break
df = pd.DataFrame(data)
return df
最重要的是知道正确的格式.我四处搜寻并尝试猜测'>3i2f'
(或>3I2f
)效果很好. (这是3位整数的大int 2浮点数.您的建议:'i4f'
不会产生明智的浮点数-无论是大端还是小尾数.)有关struct
和格式语法,请参见
The most important thing is to know the right format. I googled around and tried to guess and '>3i2f'
(or >3I2f
) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f'
doesn't produce sensible floats - regardless whether big or little endian.) For struct
and format syntax see the docs.
df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]:
0 1 2 3 4
0 210 110218 110216 1.87 1.12
1 362 110219 110216 1.00 5.85
2 875 110220 110217 1.00 1.12
3 1408 110220 110218 1.50 1.00
4 1884 110221 110219 3.94 1.00
更新
要将bi5_to_df
的输出与 https://github.com/ninety47/dukascopy 进行比较,
我从那里编译并运行test_read_bi5
.输出的第一行是:
To compare the output of bi5_to_df
with https://github.com/ninety47/dukascopy,
I compiled and run test_read_bi5
from there. The first lines of the output are:
time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5
在同一输入文件上的
和bi5_to_df
给出:
And bi5_to_df
on the same input file gives:
bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]:
0 1 2 3 4
0 3581 131966 131945 1.50 1.5
1 5142 131964 131943 1.50 1.5
2 5202 131964 131943 2.25 1.5
3 5321 131964 131944 1.50 1.5
4 5441 131964 131944 1.50 1.5
所以一切似乎都很好(ninety47的代码对列进行了重新排序).
So everything seems to be fine (ninety47's code reorders columns).
此外,使用'>3I2f'
代替'>3i2f'
(即unsigned int
代替int
)可能更准确.
Also, it's probably more accurate to use '>3I2f'
instead of '>3i2f'
(i.e. unsigned int
instead of int
).
这篇关于解压缩并读取Dukascopy .bi5刻度文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!