将 NDB 数据存储记录导出到 Cloud Storage CSV 文件 [英] Export NDB Datastore records to Cloud Storage CSV file

查看:31
本文介绍了将 NDB 数据存储记录导出到 Cloud Storage CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 NDB 数据存储中,我有超过 200 万条记录.我想将这些按 created_at 日期分组的记录导出到 Google Cloud Storage 上的 CSV 文件中.我计算出每个文件大约为 1GB.

2014-03-18.csv,~17000 条记录,~1GB2014-03-17.csv,~17000 条记录,~1GB2014-03-18.csv,~17000 条记录,~1GB...

我的第一种方法(伪代码):

将云存储导入为 gcsgcs_file = gcs.open(date + '.csv', 'w')查询 = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)记录 = query.fetch_page(50, 游标)记录在记录中:gcs_file.write(记录)

但这(显然?)会导致内存问题:

错误:在总共处理 2 个请求后,超过了 622.16 MB 的软私有内存限制

<小时>

我应该使用 MapReduce Pipeline 还是有什么方法可以使方法 1 起作用?如果使用 MapReduce:我可以过滤 created_at 而不遍历 NDB 中的所有记录吗?

解决方案

我终于想通了.由于所有数据都在 NDB 数据存储中,我无法在本地测试所有内容,因此我找到了 logging.info("Memory Usage: %s", runtime.memory_usage().current())非常有帮助.(使用 from google.appengine.api import runtime 导入).

问题在于上下文缓存": 查询结果被写回上下文缓存.) 正如@brian 在上面的评论中所建议的那样,内存消耗峰值约为 260 MB.但是花了很长时间,大约20分钟.

In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.

2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...

My first approach (pseudo-code):

import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
   gcs_file.write(record)

But this (obviously?) leads into memory issues:

Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total


Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at without iterating over all records in NDB?

解决方案

I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current()) extremely helpful. (Import with from google.appengine.api import runtime).

The problem is the "In-Context Cache": query results are written back to the in-context cache. More information. See an example to disable the In-Context Cache for an Entity Kind.

My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes.

Peak memory consumption was about 480MB.

In comparison, with an added gc.collect() in the while True: loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes.

这篇关于将 NDB 数据存储记录导出到 Cloud Storage CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆