如何在云数据存储中创建大量实体 [英] How to create large number of entities in Cloud Datastore

查看:140
本文介绍了如何在云数据存储中创建大量实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的要求是在Google Cloud Datastore中创建大量实体。我有csv文件,并在结合数量的实体可以是50k左右。我尝试了以下内容:



1. 逐行读取csv文件并在datstore中创建实体。
问题:它运行良好,但超时并且无法一次创建所有实体。

2。 将Blobstore中的所有文件上传并将它们改为数据存储区
问题:我尝试使用Mapper功能读取上传到Blobstore中的csv文件并在数据存储区中创建实体。我有的问题是,如果文件大小超过2Mb,映射器不起作用。我还试图读取servlet中的文件,但又遇到了timedout问题。



我正在寻找一种方法在数据存储中创建大量实体(50k +)以上一个去。

解决方案

实体的数量在这里不是问题(50K是相对平凡的)。



从您的问题中不清楚您处理CSV的位置,因此我猜测它是用户请求的一部分 - 这意味着您的任务完成时间为60秒。



任务队列



我建议你看看使用任务队列,当您上传需要处理的CSV时, 推入队列中进行后台处理。



使用任务队列时,任务本身仍有最后期限,但大于60秒(自动缩放10分钟)。您应该详细了解文档中的截止日期,以便确保您了解如何处理它们,包括捕获 DeadlineExceededError 错误,以便您可以保存到达CSV时的位置,以便可以在< a href =https://cloud.google.com/appengine/docs/python/taskqueue/overview-push#task_retries =nofollow>重试。

警告截获DeadlineExceededError



警告: DeadlineExceededError可能会从程序中的任何位置引发,包括最后块,所以它可能会使程序处于无效状态。这可能会导致线程代码(包括内置线程库)中的死锁或意外错误,因为可能不会释放锁。请注意(与Java不同),运行时可能不会终止进程,因此这可能会导致将来对同一实例的请求出现问题。为了安全起见,您不应该依赖DeDelineExceededError,而应确保您的请求在时间限制之前完成。



如果您担心上述问题,并且无法确保您的任务在10分钟截止时间内完成,您有2个选项:


  1. 切换到手动缩放实例, 24小时最后期限。

  2. 确保您的任务能够保存进度并在10分钟截止日期之前返回一个错误,以便在不必捕捉错误的情况下正确恢复。


My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:

1. Read a csv file line by line and create entity in the datstore. Issues: It works well but it timed out and cannot create all the entities in one go.

2. Uploaded all files in Blobstore and red them to datastore Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.

I am looking for a way to create above(50k+) large number of entities in datastore all in one go.

解决方案

Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.

It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.

Task Queues

I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.

When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.

Caveat on catching DeadlineExceededError

Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.

If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:

  1. Switch to a manually scaled instance which gives you are 24 hour deadline.
  2. Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.

这篇关于如何在云数据存储中创建大量实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆