如何在没有tmp存储的情况下将二进制数据通过管道传递到numpy数组中? [英] How to pipe binary data into numpy arrays without tmp storage?

查看:124
本文介绍了如何在没有tmp存储的情况下将二进制数据通过管道传递到numpy数组中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有几个类似的问题,但没有一个直接回答这个简单的问题:

There are several similar questions but none of them answers this simple question directly:

我如何捕获命令输出并将内容流式传输到numpy数组中,而没有创建临时字符串对象以供读取?

How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?

所以,我想做的是这样:

So, what I would like to do is this:

import subprocess
import numpy
import StringIO

def parse_header(fileobject):
    # this function moves the filepointer and returns a dictionary
    d = do_some_parsing(fileobject)
    return d

sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])

# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)

# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)

我用StringIO和cStringIO尝试过,但是numpy.frombuffer和numpy.fromfile都不接受.

I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.

使用StringIO对象,我首先必须将流读取为字符串,然后使用numpy.fromstring,但我想避免创建中间对象(几个千兆字节).

Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).

对我来说,另一种选择是,我可以将sys.stdin流式传输到numpy数组中,但也不能与numpy.fromfile一起使用(请寻求实现).

An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).

有没有解决方法?我不能成为第一个尝试此操作的人(除非这是PEBKAC案?)

Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)

解决方案: 这是当前的解决方案,它是unutbu的指令如何将PIopen与PIPE一起使用以及eryksun提示使用字节数组的结合,所以我不知道该接受谁! :S

Solution: This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S

proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)

我没有检查数据是否真的没有创建另一个副本,也不知道如何.但是我注意到它的工作速度比我之前尝试的要快得多,非常感谢两个答案的作者!

I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!

推荐答案

您可以将Popenstdout=subprocess.PIPE结合使用.读入标头,然后将其余内容加载到bytearray中以与np.frombuffer一起使用.

You can use Popen with stdout=subprocess.PIPE. Read in the header, then load the rest into a bytearray to use with np.frombuffer.

基于您的修改的其他评论:

Additional comments based on your edit:

如果您要呼叫proc.stdout.read(),则等效于使用check_output().两者都创建一个临时字符串.如果预分配data,则可以使用proc.stdout.readinto(data).然后,如果读入data的字节数小于len(data),则释放多余的内存,否则将data扩展剩下的要读取的内容.

If you're going to call proc.stdout.read(), it's equivalent to using check_output(). Both create a temporary string. If you preallocate data, you could use proc.stdout.readinto(data). Then if the number of bytes read into data is less than len(data), free the excess memory, else extend data by whatever is left to be read.

data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
    data[n:] = ''        
else:
    data += proc.stdout.read()

您也可以从预先分配的ndarray ndata开始并使用buf = np.getbuffer(ndata).然后如上readinto(buf).

You could also come at this starting with a pre-allocated ndarray ndata and use buf = np.getbuffer(ndata). Then readinto(buf) as above.

下面是一个示例,说明bytearraynp.ndarray之间共享内存:

Here's an example to show that the memory is shared between the bytearray and the np.ndarray:

>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')

这篇关于如何在没有tmp存储的情况下将二进制数据通过管道传递到numpy数组中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆