如何使用Cloud Composer将大型数据从Postgres导出到S3? [英] How to export large data from Postgres to S3 using Cloud composer?

查看:98
本文介绍了如何使用Cloud Composer将大型数据从Postgres导出到S3?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用Postgres to S3运算符将数据从Postgres加载到S3。但是最近,我不得不导出一个非常大的表,而我的Airflow Composer失败了,没有任何日志,这可能是因为我们正在使用Python的tempfile模块的NamedTemporaryFile函数创建一个临时文件,并且正在使用此临时文件加载到S3 。由于我们使用的是Composer,因此该文件将被加载到Composer的本地内存中,并且由于文件很大,因此失败了。

I have been using the Postgres to S3 operator to load data from Postgres to S3. But recently, I had to export a very large table and my Airflow composer fails without any logs, this could be because we are using the NamedTemporaryFile function of Python's tempfile module to create a temporary file and we are using this temporary file to load to S3. Since we are using Composer, this will be loaded to Composer's local memory, and since the size of the file is very large, it is failing.

在这里参考: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_sourcing_logs

我确实检查了RedshiftToS3运算符,因为它也使用了Postgres钩子,并且它具有几个可以轻松加载大文件的卸载选项,但是我意识到Redshift和Postgres之间没有1-1的对应关系。因此这是不可能的。有什么办法可以拆分我的Postgres查询?现在,我正在执行 SELECT * FROM TABLENAME 而且,我没有关于该表的任何信息。

I did check the RedshiftToS3 operator since that was also using a Postgres hook and it had several unload options which could easily load large files but I realised there is no 1-1 correspondence between Redshift and Postgres. So that is not possible. Is there any way I can split my Postgres query? Right now I'm doing SELECT * FROM TABLENAME Also, I do not have any information about the table.

我也遇到了这个类似的运算符: https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/sql_to_gcs.html

I also came across this similar operator: https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/sql_to_gcs.html

此处有一个参数 approx_max_file_size_bytes


此运算符支持能够将大表转储拆分为
个多个文件的功能(请参见上面的filename参数文档中的注释)。这个
参数允许开发人员指定分割的文件大小。

This operator supports the ability to split large table dumps into multiple files (see notes in the filename param docs above). This param allows developers to specify the file size of the splits.

我从代码中了解到的是当大小超过给定的限制时创建一个新的临时文件,那么它们会将文件分割成多个临时文件,然后分别上传吗?

What I understood from the code is that they are creating a new temporary file when the size exceeds the given limit, so are they splitting the file into multiple temp files and then uploading separately?

编辑:
我将再次确切地说明我要做什么。当前,Postgres to S3操作员将创建一个临时文件,并将游标返回的所有结果写入此文件,这会导致内存问题。所以我在想的是,我可以添加一个max_file_size限制,对于游标中的每一行,我都将结果写入临时文件,如果临时文件的大小超过了我们设置的max_file_size限制,我们将写入内容文件到S3,然后刷新或删除该文件,然后创建一个新的临时文件,并将游标的下一行写入此文件,并将该文件也上传到S3。我不确定如何修改运算符吗?

I will again explain exactly what I'm trying to do. Currently, the Postgres to S3 operator creates a temp file and writes all the results returned by the cursor to this file, and that is causing memory issue. So what I'm thinking is, I could add a max_file_size limit and for each row in cursor I will be writing the results to our temp file and if the size of our temp file exceeds max_file_size limit we set, we write the contents of our file to S3, then flush or delete this file and then create a new temp file and write the next row of cursor to this file and upload that file as well to S3. I'm not sure how to modify the operator like that?

推荐答案

您已经知道这是因为您正在构建创建一个包含表中每一行的字典,当表中有很多行时,机器上的内存不足。

As you've figured out already it's because you are building up a dictionary with every row in the table, when you have many rows in your table you run out of memory on the machine.

您确实已经回答了自己的问题:只写一个直到文件达到特定大小,然后将文件推送到S3。或者,您可以将文件保留在磁盘上,并每x行刷新一次字典对象,但是在这种情况下,文件可能会在磁盘上而不是在内存中变得非常大。

You've already answered your own question really: only write a until the file reaches a certain size then push the file to S3. Alternatively you could just keep the file on disk and flush the dictionary object every x rows but your file could grow very large on disk rather than in memory in that case.

这篇关于如何使用Cloud Composer将大型数据从Postgres导出到S3?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆