如何在没有tmp存储的情况下将二进制数据通过管道传递到numpy数组中? [英] How to pipe binary data into numpy arrays without tmp storage?
问题描述
有几个类似的问题,但没有一个直接回答这个简单的问题:
There are several similar questions but none of them answers this simple question directly:
我如何捕获命令输出并将内容流式传输到numpy数组中,而没有创建临时字符串对象以供读取?
How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?
所以,我想做的是这样:
So, what I would like to do is this:
import subprocess
import numpy
import StringIO
def parse_header(fileobject):
# this function moves the filepointer and returns a dictionary
d = do_some_parsing(fileobject)
return d
sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])
# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)
# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)
我用StringIO和cStringIO尝试过,但是numpy.frombuffer和numpy.fromfile都不接受.
I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.
使用StringIO对象,我首先必须将流读取为字符串,然后使用numpy.fromstring,但我想避免创建中间对象(几个千兆字节).
Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).
对我来说,另一种选择是,我可以将sys.stdin流式传输到numpy数组中,但也不能与numpy.fromfile一起使用(请寻求实现).
An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).
有没有解决方法?我不能成为第一个尝试此操作的人(除非这是PEBKAC案?)
Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)
解决方案: 这是当前的解决方案,它是unutbu的指令如何将PIopen与PIPE一起使用以及eryksun提示使用字节数组的结合,所以我不知道该接受谁! :S
Solution: This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S
proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)
我没有检查数据是否真的没有创建另一个副本,也不知道如何.但是我注意到它的工作速度比我之前尝试的要快得多,非常感谢两个答案的作者!
I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!
推荐答案
您可以将Popen
与stdout=subprocess.PIPE
结合使用.读入标头,然后将其余内容加载到bytearray
中以与np.frombuffer
一起使用.
You can use Popen
with stdout=subprocess.PIPE
. Read in the header, then load the rest into a bytearray
to use with np.frombuffer
.
基于您的修改的其他评论:
Additional comments based on your edit:
如果您要呼叫proc.stdout.read()
,则等效于使用check_output()
.两者都创建一个临时字符串.如果预分配data
,则可以使用proc.stdout.readinto(data)
.然后,如果读入data
的字节数小于len(data)
,则释放多余的内存,否则将data
扩展剩下的要读取的内容.
If you're going to call proc.stdout.read()
, it's equivalent to using check_output()
. Both create a temporary string. If you preallocate data
, you could use proc.stdout.readinto(data)
. Then if the number of bytes read into data
is less than len(data)
, free the excess memory, else extend data
by whatever is left to be read.
data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
data[n:] = ''
else:
data += proc.stdout.read()
您也可以从预先分配的ndarray
ndata
开始并使用buf = np.getbuffer(ndata)
.然后如上readinto(buf)
.
You could also come at this starting with a pre-allocated ndarray
ndata
and use buf = np.getbuffer(ndata)
. Then readinto(buf)
as above.
下面是一个示例,说明bytearray
和np.ndarray
之间共享内存:
Here's an example to show that the memory is shared between the bytearray
and the np.ndarray
:
>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')
这篇关于如何在没有tmp存储的情况下将二进制数据通过管道传递到numpy数组中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!