将NDB数据存储记录导出到云存储CSV文件 [英] Export NDB Datastore records to Cloud Storage CSV file

查看:139
本文介绍了将NDB数据存储记录导出到云存储CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的NDB数据存储中,我有超过200万条记录。我想将这些由 created_at 日期分组的记录导出到Google云端存储上的CSV文件中。我计算出每个文件大约1GB。

  2014-03-18.csv,〜17000条记录,〜1GB 
2014-03-17.csv,〜17000条记录,〜1GB
2014-03-18.csv,〜17000条记录,〜1GB
...

我的第一种方法(伪代码):

  import cloudstorage as gcs 
gcs_file = gcs.open(date +'.csv','w')
query = Item.query()。filter(Item.created_at> = date ).filter(Item.created_at< date + 1day)
records = query.fetch_page(50,cursor)
用于记录中的记录:
gcs_file.write(记录)

但是,这显然会导致内存问题:

 错误:在处理2个请求之后超过622.16 MB的软私有内存限制总计






我应该使用MapReduce Pipeline吗,还是有什么办法让方法1工作?如果使用MapReduce:我可以过滤 created_at 而无需遍历NDB中的所有记录吗?

解决方案

我终于明白了。因为所有的数据都在NDB数据存储中,所以我没有真正能够在本地测试所有东西,所以我找到了 logging.info(Memory Usage:%s,runtime.memory_usage()。current())非常有帮助。 (导入时使用 from google.appengine.api import runtime )。



问题在于In -Context Cache:将查询结果写回到上下文缓存中。 ),如上面评论中@brian所建议的那样,内存消耗高峰约为260MB。但花了相当长的时间,大约20分钟。




In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.

2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...

My first approach (pseudo-code):

import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
   gcs_file.write(record)

But this (obviously?) leads into memory issues:

Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total


Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at without iterating over all records in NDB?

解决方案

I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current()) extremely helpful. (Import with from google.appengine.api import runtime).

The problem is the "In-Context Cache": query results are written back to the in-context cache. More information. See an example to disable the In-Context Cache for an Entity Kind.

My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes.

Peak memory consumption was about 480MB.

In comparison, with an added gc.collect() in the while True: loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes.

这篇关于将NDB数据存储记录导出到云存储CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆