将NDB数据存储记录导出到云存储CSV文件 [英] Export NDB Datastore records to Cloud Storage CSV file

查看：139 发布时间：2018/5/3 19:24:09 google-app-engine csv mapreduce google-cloud-storage app-engine-ndb

本文介绍了将NDB数据存储记录导出到云存储CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的NDB数据存储中，我有超过200万条记录。我想将这些由 created_at 日期分组的记录导出到Google云端存储上的CSV文件中。我计算出每个文件大约1GB。

  2014-03-18.csv，〜17000条记录，〜1GB 
 2014-03-17.csv，〜17000条记录，〜1GB 
 2014-03-18.csv，〜17000条记录，〜1GB 
 ... 
  
 我的第一种方法（伪代码）：
 
 
  import cloudstorage as gcs 
 gcs_file = gcs.open（date +'.csv'，'w'）
 query = Item.query（）。filter（Item.created_at> = date ）.filter（Item.created_at< date + 1day）
 records = query.fetch_page（50，cursor）
用于记录中的记录：
 gcs_file.write（记录）
  
但是，这显然会导致内存问题：
 
 
 错误：在处理2个请求之后超过622.16 MB的软私有内存限制总计
  
 
 
 
 
 
 我应该使用MapReduce Pipeline吗，还是有什么办法让方法1工作？如果使用MapReduce：我可以过滤 created_at 而无需遍历NDB中的所有记录吗？
解决方案
 我终于明白了。因为所有的数据都在NDB数据存储中，所以我没有真正能够在本地测试所有东西，所以我找到了 logging.info（Memory Usage：％s，runtime.memory_usage（）。current（））非常有帮助。 （导入时使用 from google.appengine.api import runtime ）。
 
 
  问题在于In -Context Cache：将查询结果写回到上下文缓存中。  ），如上面评论中@brian所建议的那样，内存消耗高峰约为260MB。但花了相当长的时间，大约20分钟。 
 
 
   
 
In my NDB Datastore I have more than 2 million records. I want to export these records grouped by created_at date into CSV files on Google Cloud Storage. I calculated that every file would then be about 1GB.
2014-03-18.csv, ~17000 records, ~1GB
2014-03-17.csv, ~17000 records, ~1GB
2014-03-18.csv, ~17000 records, ~1GB
...
My first approach (pseudo-code):
import cloudstorage as gcs
gcs_file = gcs.open(date + '.csv', 'w')
query = Item.query().filter(Item.created_at >= date).filter(Item.created_at < date+1day)
records = query.fetch_page(50, cursor)
for record in records:
   gcs_file.write(record)
But this (obviously?) leads into memory issues:
Error: Exceeded soft private memory limit with 622.16 MB after servicing 2 requests total




Should I use a MapReduce Pipeline instead or is there any way to make approach 1 work? If using MapReduce: Could I filter for created_at without iterating over all records in NDB?
 解决方案 
I finally figured it out. Since all data is in NDB datastore I wasn't really able to test everything locally, so I found logging.info("Memory Usage: %s", runtime.memory_usage().current()) extremely helpful. (Import with from google.appengine.api import runtime).

The problem is the "In-Context Cache": query results are written back to the in-context cache. More information.
See an example to disable the In-Context Cache for an Entity Kind.

My calculation was slightly wrong though. A generated CVS file is about 300 MB big. It is generated/ saved to Google Cloud Storage within 5 minutes. 



Peak memory consumption was about 480MB. 

In comparison, with an added gc.collect() in the while True: loop (link) as suggested by @brian in the comment above, the memory consumption peak was about 260MB. But it took quite long, about 20 minutes. 



                        这篇关于将NDB数据存储记录导出到云存储CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    
                    
                        查看全文

将NDB数据存储记录导出到云存储CSV文件 [英] Export NDB Datastore records to Cloud Storage CSV file

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将NDB数据存储记录导出到云存储CSV文件 [英] Export NDB Datastore records to Cloud Storage CSV file

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭