读取非常大的Blob而不将其下载到Google Cloud中(流式传输?) [英] Reading really big blobs without downloading them in Google Cloud (streaming?)
问题描述
请帮助!
[+]我所拥有的: 每个存储桶中都有很多斑点. Blob的大小可以变化,从小于千字节到大量千兆字节.
[+] What I have: A lot of blobs in every bucket. Blobs can vary in size from being less than a Kilo-byte to being lots of Giga-bytes.
[+]我正在尝试做的事情: 我需要能够在那些Blob中流传输数据(例如大小为1024的缓冲区或类似的东西),或者在Python中按一定大小的块读取它们.关键是我不认为我只能执行bucket.get_blob(),因为如果blob是TeraByte,那么我将无法在物理内存中存储它.
[+] What I'm trying to do: I need to be able to either stream the data in those blobs (like a buffer of size 1024 or something like that) or read them by chunks of a certain size in Python. The point is I don't think I can just do a bucket.get_blob() because if the blob was a TeraByte then I wouldn't be able to have it in physical memory.
[+]我实际上正在尝试做的事情: 解析Blob中的信息以识别关键字
[+] What I'm really trying to do: parse the information inside the blobs to identify key-words
[+]我已阅读的内容: 大量有关如何分块写入Google Cloud,然后使用compose将其缝合在一起的文档(完全没有帮助)
[+] What I've read: A lot of documentation on how to write to google cloud in chunks and then use compose to stitch it together (not helpful at all)
关于Java的预取功能的大量文档(需要是python)
A lot of documentation on java's pre-fetch functions (needs to be python)
google cloud API的
The google cloud API's
如果有人能指出正确的方向,我将不胜感激! 谢谢
If anyone could point me the right direction I would be really grateful! Thanks
推荐答案
所以我发现这样做的一种方法是在python中创建一个类似文件的对象,然后使用Google-Cloud API调用.download_to_file()类似于文件的对象.
So a way I have found of doing this is by creating a file-like object in python then using the Google-Cloud API call .download_to_file() with that file-like object.
这本质上是流数据. python代码看起来像这样
This in essence streams data. python code looks something like this
def getStream(blob):
stream = open('myStream','wb', os.O_NONBLOCK)
streaming = blob.download_to_file(stream)
使用os.O_NONBLOCK标志可以在写入文件时进行读取. 我仍未使用大型文件进行测试,因此,如果有人知道更好的实现方法,或者看到这样做有潜在的失败,请发表评论. 谢谢!
The os.O_NONBLOCK flag is so I can read while I'm writing to the file. I still haven't tested this with really big files so if anyone knows a better implementation or see's a potential failure with this please comment. Thanks!
这篇关于读取非常大的Blob而不将其下载到Google Cloud中(流式传输?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!