如何将数据批量上传到 appengine 数据存储?旧方法不起作用 [英] How to upload data in bulk to the appengine datastore? Older methods do not work

查看:27
本文介绍了如何将数据批量上传到 appengine 数据存储?旧方法不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这应该是一个相当普遍的要求,并且是一个简单的过程:将数据批量上传到 appengine 数据存储区.

This should be a fairly common requirement, and a simple process: upload data in bulk to the appengine datastore.

但是,stackoverflow 上提到的旧解决方案(下面的链接*)似乎都不再起作用了.当使用 DB API 上传到数据存储时,bulkloader 方法是最合理的解决方案,但不适用于 NDB API

However, none of the older solutions mentioned on stackoverflow (links below*) seem to work anymore. The bulkloader method, which was the most reasonable solution when uploading to the datastore using the DB API doesn't work with the NDB API

现在,bulkloader 方法似乎已被弃用,并且文档中仍然存在的旧链接会导致错误的页面.举个例子

And now the bulkloader method seems to have been deprecated and the old links, which are still present in the docs, lead to the wrong page. Here's an example

https://developers.google.com/appengine/docs/python/工具/上传数据

以上链接仍然存在于此页面上:https://developers.google.com/appengine/docs/python/tools/uploadinganapp

This above link is still present on this page: https://developers.google.com/appengine/docs/python/tools/uploadinganapp

现在推荐的批量加载数据的方法是什么?

What is the recommended method for bulkloading data now?

两个可行的替代方案似乎是 1) 使用 remote_api 或 2) 将 CSV 文件写入 GCS 存储桶并从中读取.有人有成功使用这两种方法的经验吗?

The two feasible alternatives seem to be 1) using the remote_api or 2) writing a CSV file to a GCS bucket and reading from that. Anybody have experience successfully using either method?

任何指针将不胜感激.谢谢!

Any pointers will be greatly appreciated. Thanks!

[*以下链接中提供的解决方案不再有效]

[*The solutions offered at the links below are no longer valid]

[1] 如何将数据批量上传到 Google Appengine 数据存储区?

[2] 如何插入批量数据Google App Engine 数据存储区?

推荐答案

你们中的一些人可能和我的情况一样:我无法使用数据存储的导入/导出实用程序,因为我的数据需要在进入数据存储之前进行转换.

Some of you might be in my situation: I cannot use the import/export utility of datastore, because my data needs to be transformed before getting into the datastore.

我最终使用了 apache-beam (谷歌云数据流).

I ended up using apache-beam (google cloud dataflow).

你只需要写几行beam"代码即可

You only need to write a few lines of "beam" code to

  • 读取您的数据(例如,托管在云存储上) - 您将获得一个 PCollection 字符串,
  • 做任何你想做的转换(所以你得到一个数据存储实体的PCollection),
  • 将它们转储到 数据存储接收器.
  • read your data (for example, hosted on cloud storage) - you get a PCollection of strings,
  • do whatever transform you want (so you get a PCollection of datastore Entities),
  • dump them to datastore sink.

请参阅如何加速批量处理使用多个工作人员导入谷歌云数据存储? 用于具体用例.

See How to speedup bulk importing into google cloud datastore with multiple workers? for a concrete use case.

我能够以每秒 800 个实体的速度将 5 个工作人员写入我的数据存储中.这使我能够在大约 5 小时内完成导入任务(1600 万行).如果你想让它更快,请使用更多的工人 :D

I was able to write with a speed of 800 entities per second into my datastore with 5 workers. This enabled me to finish the importing task (with 16 million rows) in about 5 hours. If you want to make it faster, use more workers :D

这篇关于如何将数据批量上传到 appengine 数据存储?旧方法不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆