numpy:fromfile压缩文件 [英] numpy: fromfile for gzipped file

查看:122
本文介绍了numpy:fromfile压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用numpy.fromfile构造一个可以传递给pandas.DataFrame构造函数的数组

I am using numpy.fromfile to construct an array which I can pass to the pandas.DataFrame constructor

import numpy as np
import pandas as pd

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })
    return pd.DataFrame(np.fromfile(file, dt))

我想扩展此方法以处理压缩文件.

I would like to extend this method to work with gzipped files.

根据 numpy.fromfile 文档,第一个参数是文件:

According to the numpy.fromfile documentation, the first parameter is file:

file : file or str
Open file object or filename

因此,我添加了以下内容以检查gzip文件路径:

As such, I added the following to check for a gzip file path:

if isinstance(file, str) and file.endswith(".gz"):
    file = gzip.open(file, "r")

但是,当我尝试通过fromfile构造函数传递它时,会得到一个IOError:

However, when I try pass this through the fromfile constructor I get an IOError:

IOError: first argument must be an open file

问题:

如何用压缩文件调用numpy.fromfile?

根据注释中的请求,显示检查gzip压缩文件的实现:

As per request in comments, showing implementation which checks for gzipped files:

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })

    if isinstance(file, str) and file.endswith(".gz"):
        file = gzip.open(file, "r")

    return pd.DataFrame(np.fromfile(file, dt))

推荐答案

open.gzip()不会返回真正的file对象.它是只鸭子..它走路像鸭子,听起来像鸭子,但根据numpy并不是鸭子.因此numpy是严格的(因为很多东西都是用较低级的C代码编写的,因此可能需要一个实际的文件描述符.)

open.gzip() doesn't return a true file object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy. So numpy is being strict (since much is written in lower level C code, it might require an actual file descriptor.)

您可以从gzip.open()调用中获取底层的file,但这只是为您提供压缩流.

You can get the underlying file from the gzip.open() call, but that's just going to get you the compressed stream.

这就是我要做的:我将使用subprocess.Popen()调用zcat将文件作为流解压缩.

This is what I would do: I would use subprocess.Popen() to invoke zcat to uncompress the file as a stream.

>>> import subprocess
>>> p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)
>>> type(p.stdout)
<type 'file'>
>>> p.stdout.read()
'hello world\n'

现在,您可以将p.stdout作为file对象传递给numpy:

Now you can pass p.stdout as a file object to numpy:

np.fromfile(p.stdout, ...)

这篇关于numpy:fromfile压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆