定期将大数据(json)导入Firebase [英] Import large data (json) into Firebase periodically

本文介绍了定期将大数据(json)导入Firebase的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们处于这种情况下,我们将必须定期更新Firebase中的大量数据(约5个Mio记录).目前,我们有一些json文件,大小约为1 GB.

We are in the situation that we will have to update large amounts of data (ca. 5 Mio Records) in firebase periodically. At the moment we have a few json files that are around ~1 GB in size.

作为现有的第三方解决方案(此处 app引擎/谷歌云存储/谷歌云数据存储的组合.

As existing third party solutions (here and here) have some reliability issues (import object per object; or need for open connection) and are quite disconnected to the google cloud platform ecosystem. I wonder if there is now an "official" way using i.e. the new google cloud functions? Or a combination with app engine / google cloud storage / google cloud datastore.

我真的很喜欢不使用身份验证-云功能似乎可以很好地处理,但是我认为该功能会超时(?)

I really like not to deal with authentication — something that cloud functions seems to handle well, but I assume the function would time out (?)

  1. 是否具有长期运行的云功能来进行数据获取/插入? (有意义吗?)
  2. 将json文件放入&从Google云平台内部的某个地方开始?
  3. 首先将大数据放入google-cloud-datastore是否有意义(即,太昂贵以至于无法存储在firebase中),或者可以可靠地将firebase实时数据库用作大型数据存储.

推荐答案

我终于发布了答案,因为它与2017年新的Google Cloud Platform工具保持一致.

I finally post the answer as it aligns with the new Google Cloud Platform tooling of 2017.

新推出的Google Cloud Functions的运行时间大约为9分钟( 540秒).但是,云功能可以像这样从云存储创建一个node.js读取流( @ googlecloud/npm上的存储)

The newly introduced Google Cloud Functions have a limited run-time of approximately 9 minutes (540 seconds). However, cloud functions are able to create a node.js read stream from cloud storage like so (@googlecloud/storage on npm)

var gcs = require('@google-cloud/storage')({
// You don't need extra authentication when running the function
// online in the same project
  projectId: 'grape-spaceship-123',
  keyFilename: '/path/to/keyfile.json'
});

// Reference an existing bucket. 
var bucket = gcs.bucket('json-upload-bucket');

var remoteReadStream = bucket.file('superlarge.json').createReadStream();

即使它是远程流,它也是高效的.在测试中,我能够在4分钟内解析大于3 GB的json,并进行了简单的json转换.

Even though it is a remote stream, it is highly efficient. In tests I was able to parse jsons larger than 3 GB under 4 minutes, doing simple json transformations.

由于我们现在正在使用node.js流,因此任何JSONStream库都可以动态转换数据( npm上的JSONStream ),就像处理带有事件流的大型数组一样异步处理数据( npm上的事件流).

As we are working with node.js streams now, any JSONStream Library can efficiently transform the data on the fly (JSONStream on npm), dealing with the data asynchronously just like a large array with event streams (event-stream on npm).

es = require('event-stream')

remoteReadStream.pipe(JSONStream.parse('objects.*'))
  .pipe(es.map(function (data, callback(err, data)) {
    console.error(data)
    // Insert Data into Firebase.
    callback(null, data) // ! Return data if you want to make further transformations.
  }))

在管道末尾的回调中仅返回null,以防止内存泄漏阻塞整个函数.

Return only null in the callback at the end of the pipe to prevent a memory leak blocking the whole function.

如果您进行的重转换需要较长的运行时间,则可以在firebase中使用作业数据库"来跟踪您所在的位置,并且仅执行100.000转换并再次调用该函数,或者设置一个额外的函数来监听插入"forimport db"中,最终将原始jsons对象记录异步转换为您的目标格式和生产系统.拆分导入和计算.

If you do heavier transformations that require a longer run time, either use a "job db" in firebase to track where you are at and only do i.e. 100.000 transformations and call the function again, or set up an additional function which listens on inserts into a "forimport db" that finally transforms the raw jsons object record into your target format and production system asynchronously. Splitting import and computation.

此外,您可以在nodejs应用引擎中运行云函数代码.但不一定非要如此.

Additionally, you can run cloud functions code in a nodejs app engine. But not necessarily the other way around.

这篇关于定期将大数据(json)导入Firebase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆