加入所有PostgreSQL表并创建一个Python字典 [英] Join all PostgreSQL tables and make a Python dictionary

查看:494
本文介绍了加入所有PostgreSQL表并创建一个Python字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要加入所有 PostgreSQL表格,并在Python字典中进行转换。数据库中有72个表。列的总数大于 1600

I need to join all PostgreSQL tables and convert them in a Python dictionary. There are 72 tables in the database. The total number of columns is greater than 1600.

我写了一个简单的Python脚本,它连接了多个表,但由于内存错误。脚本执行期间占用所有内存。我在一个新的虚拟服务器上运行该脚本,内存为 128GB RAM和8个CPU。它在lambda函数执行期间失败。

I wrote a simple Python script that joins several tables but fails to join all of them due to the memory error. All memory is occupied during the script execution. And I run the script on a new virtual server with 128GB RAM and 8 CPU. It fails during the lambda function execution.

如何改进以下代码以执行所有表join?

How could the following code be improved to execute all tables join?

from sqlalchemy import create_engine
import pandas as pd

auth = 'user:pass'
engine = create_engine('postgresql://' + auth + '@host.com:5432/db')

sql_tables = ['table0', 'table1', 'table3', ..., 'table72']        
df_arr = []
[df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables]

df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr)
raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data'))

print(df_join)
print(raw_dict)
print(len(df_arr))

可以使用 Pandas 为我的目的?是否有更好的解决方案?

Is it ok to use Pandas for my purpose? Are there better solutions?

最终目标是 denormalize 数据库数据,以便将其编入 Elasticsearch

The ultimate goal is to denormalize DB data to be able to index it into Elasticsearch as documents, one document per user.

推荐答案

为什么不创建postgres函数而不是脚本?

Why don't you create a postgres function instead of script?

这里有一些建议,可以帮助你避免内存错误:

Here are some advises that could help you to avoid the memory error:


  • WITH 子句,可以更好地利用您的内存。

  • 您可以创建一些物理表来存储
    的信息数据库。这些物理表将避免使用大量的内存。之后,你要做的只是加入那些物理表。您可以为它创建一个函数。

  • 您可以通过反规范化您需要的表来创建数据仓库。

  • 最后但并非最不重要:确保您正确使用索引

  • You can use WITH clause which makes better use of your memory.
  • You can create some physical tables for storing the information of different groups of tables of your database. These physical tables will avoid to use a great amount of memory. After that, all you have to do is joining only those physical tables. You can create a function for it.
  • You can create a Data Warehouse by denormalizing the tables you need.
  • Last but not least: Make sure you are using Indexes appropriately.

这篇关于加入所有PostgreSQL表并创建一个Python字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆