定期将大数据 (json) 导入 Firebase [英] Import large data (json) into Firebase periodically

本文介绍了定期将大数据 (json) 导入 Firebase的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们必须定期更新 firebase 中的大量数据(大约 5 个 Mio 记录).目前我们有一些大小约为 1 GB 的 json 文件.

We are in the situation that we will have to update large amounts of data (ca. 5 Mio Records) in firebase periodically. At the moment we have a few json files that are around ~1 GB in size.

作为现有的第三方解决方案(此处此处)存在一些可靠性问题(每个对象导入对象;或需要打开连接)并且与谷歌云平台完全断开连接生态系统.我想知道现在是否有官方"使用方式,即新的 google 云功能?或者与应用引擎/谷歌云存储/谷歌云数据存储的组合.

As existing third party solutions (here and here) have some reliability issues (import object per object; or need for open connection) and are quite disconnected to the google cloud platform ecosystem. I wonder if there is now an "official" way using i.e. the new google cloud functions? Or a combination with app engine / google cloud storage / google cloud datastore.

我真的不喜欢处理身份验证——云函数似乎处理得很好,但我认为该函数会超时(?)

I really like not to deal with authentication — something that cloud functions seems to handle well, but I assume the function would time out (?)

  1. 是否有长期运行的云函数来获取/插入数据?(有意义吗?)
  2. 将 json 文件放入 &来自谷歌云平台内部的某个地方?
  3. 首先将大数据放入 google-cloud-datastore 是否有意义(即存储在 firebase 中的成本太高)或者 firebase 实时数据库是否可以可靠地用作大数据存储.

推荐答案

我终于发布了答案,因为它与 2017 年的新 Google Cloud Platform 工具保持一致.

I finally post the answer as it aligns with the new Google Cloud Platform tooling of 2017.

新推出的 Google Cloud Functions 的运行时间有限,约为 9 分钟(540 秒).但是,云函数能够像这样从云存储创建 node.js 读取流(@googlecloud/storage on npm)

The newly introduced Google Cloud Functions have a limited run-time of approximately 9 minutes (540 seconds). However, cloud functions are able to create a node.js read stream from cloud storage like so (@googlecloud/storage on npm)

var gcs = require('@google-cloud/storage')({
// You don't need extra authentication when running the function
// online in the same project
  projectId: 'grape-spaceship-123',
  keyFilename: '/path/to/keyfile.json'
});

// Reference an existing bucket. 
var bucket = gcs.bucket('json-upload-bucket');

var remoteReadStream = bucket.file('superlarge.json').createReadStream();

虽然是远程流,但效率很高.在测试中,我能够在 4 分钟内解析大于 3 GB 的 json,进行简单的 json 转换.

Even though it is a remote stream, it is highly efficient. In tests I was able to parse jsons larger than 3 GB under 4 minutes, doing simple json transformations.

由于我们现在正在使用 node.js 流,任何 JSONStream 库都可以有效地动态转换数据 ( npm 上的 JSONStream),就像一个带有事件流的大数组一样异步处理数据( npm 上的事件流).

As we are working with node.js streams now, any JSONStream Library can efficiently transform the data on the fly (JSONStream on npm), dealing with the data asynchronously just like a large array with event streams (event-stream on npm).

es = require('event-stream')

remoteReadStream.pipe(JSONStream.parse('objects.*'))
  .pipe(es.map(function (data, callback(err, data)) {
    console.error(data)
    // Insert Data into Firebase.
    callback(null, data) // ! Return data if you want to make further transformations.
  }))

在管道末端的回调中只返回 null 以防止内存泄漏阻塞整个函数.

Return only null in the callback at the end of the pipe to prevent a memory leak blocking the whole function.

如果您执行需要更长运行时间的更重的转换,请使用 firebase 中的作业数据库"来跟踪您所在的位置,并且只执行 100.000 次转换并再次调用该函数,或者设置一个额外的函数来监听on 插入到forimport db"中,最终将原始 jsons 对象记录异步转换为目标格式和生产系统.拆分导入和计算.

If you do heavier transformations that require a longer run time, either use a "job db" in firebase to track where you are at and only do i.e. 100.000 transformations and call the function again, or set up an additional function which listens on inserts into a "forimport db" that finally transforms the raw jsons object record into your target format and production system asynchronously. Splitting import and computation.

此外,您可以在 nodejs 应用引擎中运行云函数代码.但不一定反过来.

Additionally, you can run cloud functions code in a nodejs app engine. But not necessarily the other way around.

这篇关于定期将大数据 (json) 导入 Firebase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆