将NDB数据存储记录导出到云存储CSV文件 [英] Export NDB Datastore records to Cloud Storage CSV file
问题描述
created_at
日期分组的记录导出到Google云端存储上的CSV文件中。我计算出每个文件大约1GB。 2014-03-18.csv,〜17000条记录,〜1GB
2014-03-17.csv,〜17000条记录,〜1GB
2014-03-18.csv,〜17000条记录,〜1GB
...
$ c
我的第一种方法(伪代码):
import cloudstorage as gcs
gcs_file = gcs.open(date +'.csv','w')
query = Item.query()。filter(Item.created_at> = date ).filter(Item.created_at< date + 1day)
records = query.fetch_page(50,cursor)
用于记录中的记录:
gcs_file.write(记录)
但是,这显然会导致内存问题:
错误:在处理2个请求之后超过622.16 MB的软私有内存限制总计
我应该使用MapReduce Pipeline吗,还是有什么办法让方法1工作?如果使用MapReduce:我可以过滤 created_at
而无需遍历NDB中的所有记录吗?
解决方案 我终于明白了。因为所有的数据都在NDB数据存储中,所以我没有真正能够在本地测试所有东西,所以我找到了 logging.info(Memory Usage:%s,runtime.memory_usage()。current())
非常有帮助。 (导入时使用 from google.appengine.api import runtime
)。
问题在于In -Context Cache:将查询结果写回到上下文缓存中。 ),如上面评论中@brian所建议的那样,内存消耗高峰约为260MB。但花了相当长的时间,大约20分钟。
In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at
date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.
2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...
My first approach (pseudo-code):
import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
gcs_file.write(record)
But this (obviously?) leads into memory issues:
Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total
Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at
without iterating over all records in NDB?
解决方案 I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current())
extremely helpful. (Import with from google.appengine.api import runtime
).
The problem is the "In-Context Cache": query results are written back to the in-context cache. More information.
See an example to disable the In-Context Cache for an Entity Kind.
My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes.
Peak memory consumption was about 480MB.
In comparison, with an added gc.collect()
in the while True:
loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes.
这篇关于将NDB数据存储记录导出到云存储CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文