将 NDB 数据存储记录导出到 Cloud Storage CSV 文件 [英] Export NDB Datastore records to Cloud Storage CSV file
问题描述
在我的 NDB 数据存储中,我有超过 200 万条记录.我想将这些按 created_at
日期分组的记录导出到 Google Cloud Storage 上的 CSV 文件中.我计算出每个文件大约为 1GB.
2014-03-18.csv,~17000 条记录,~1GB2014-03-17.csv,~17000 条记录,~1GB2014-03-18.csv,~17000 条记录,~1GB...
我的第一种方法(伪代码):
将云存储导入为 gcsgcs_file = gcs.open(date + '.csv', 'w')查询 = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)记录 = query.fetch_page(50, 游标)记录在记录中:gcs_file.write(记录)
但这(显然?)会导致内存问题:
错误:在总共处理 2 个请求后,超过了 622.16 MB 的软私有内存限制
<小时>
我应该使用 MapReduce Pipeline 还是有什么方法可以使方法 1 起作用?如果使用 MapReduce:我可以过滤 created_at
而不遍历 NDB 中的所有记录吗?
我终于想通了.由于所有数据都在 NDB 数据存储中,我无法在本地测试所有内容,因此我找到了 logging.info("Memory Usage: %s", runtime.memory_usage().current())
非常有帮助.(使用 from google.appengine.api import runtime
导入).
问题在于上下文缓存": 查询结果被写回上下文缓存.) 正如@brian 在上面的评论中所建议的那样,内存消耗峰值约为 260 MB.但是花了很长时间,大约20分钟.
In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at
date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.
2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...
My first approach (pseudo-code):
import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
gcs_file.write(record)
But this (obviously?) leads into memory issues:
Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total
Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at
without iterating over all records in NDB?
I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current())
extremely helpful. (Import with from google.appengine.api import runtime
).
The problem is the "In-Context Cache": query results are written back to the in-context cache. More information. See an example to disable the In-Context Cache for an Entity Kind.
My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes.
Peak memory consumption was about 480MB.
In comparison, with an added gc.collect()
in the while True:
loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes.
这篇关于将 NDB 数据存储记录导出到 Cloud Storage CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!