将 NDB 数据存储记录导出到 Cloud Storage CSV 文件 [英] Export NDB Datastore records to Cloud Storage CSV file

查看：31 发布时间：2021/11/16 19:58:00 google-app-engine csv mapreduce google-cloud-storage app-engine-ndb

本文介绍了将 NDB 数据存储记录导出到 Cloud Storage CSV 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的 NDB 数据存储中，我有超过 200 万条记录.我想将这些按 created_at 日期分组的记录导出到 Google Cloud Storage 上的 CSV 文件中.我计算出每个文件大约为 1GB.

2014-03-18.csv，~17000 条记录，~1GB2014-03-17.csv，~17000 条记录，~1GB2014-03-18.csv，~17000 条记录，~1GB...

我的第一种方法(伪代码):

将云存储导入为 gcsgcs_file = gcs.open(date + '.csv', 'w')查询 = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)记录 = query.fetch_page(50, 游标)记录在记录中:gcs_file.write(记录)

但这(显然?)会导致内存问题:

错误:在总共处理 2 个请求后，超过了 622.16 MB 的软私有内存限制

<小时>

我应该使用 MapReduce Pipeline 还是有什么方法可以使方法 1 起作用?如果使用 MapReduce:我可以过滤 created_at 而不遍历 NDB 中的所有记录吗?

解决方案

我终于想通了.由于所有数据都在 NDB 数据存储中，我无法在本地测试所有内容，因此我找到了 logging.info("Memory Usage: %s", runtime.memory_usage().current())非常有帮助.(使用 from google.appengine.api import runtime 导入).

问题在于上下文缓存": 查询结果被写回上下文缓存.) 正如@brian 在上面的评论中所建议的那样，内存消耗峰值约为 260 MB.但是花了很长时间，大约20分钟.

In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.

2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...

My first approach (pseudo-code):

import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
   gcs_file.write(record)

But this (obviously?) leads into memory issues:

Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total

Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at without iterating over all records in NDB?

解决方案

I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current()) extremely helpful. (Import with from google.appengine.api import runtime).

The problem is the "In-Context Cache": query results are written back to the in-context cache. More information. See an example to disable the In-Context Cache for an Entity Kind.

My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes.

Peak memory consumption was about 480MB.

In comparison, with an added gc.collect() in the while True: loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes.

这篇关于将 NDB 数据存储记录导出到 Cloud Storage CSV 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将 NDB 数据存储记录导出到 Cloud Storage CSV 文件 [英] Export NDB Datastore records to Cloud Storage CSV file

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将 NDB 数据存储记录导出到 Cloud Storage CSV 文件 [英] Export NDB Datastore records to Cloud Storage CSV file

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭