数据帧写入Postgresql性能不佳 [英] Dataframe writing to Postgresql poor performance

查看：130 发布时间：2020/5/24 2:46:18 python pandas postgresql dataframe sqlalchemy

本文介绍了数据帧写入Postgresql性能不佳的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在postgresql中工作我有一个笛卡尔联接，产生约400万行. 联接大约需要5秒，而写回数据库大约需要1分钟45秒.

working in postgresql I have a cartesian join producing ~4 million rows. The join takes ~5sec and the write back to the DB takes ~1min 45sec.

在python中(特别是在pandas数据框中)需要使用数据，因此我正在尝试在python中复制相同的数据.我应该在这里说所有这些测试都在一台机器上运行，因此网络上没有任何流量.

The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.

使用psycopg2和pandas，读取数据并执行联接以获取400万行(来自此处的答案:

Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.

但是，将数据写回到数据库中的表需要8分钟(最佳方法)到36分钟以上(加上我拒绝的某些方法，因为我必须在1小时后停止这些方法).

Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).

虽然我不希望重现仅SQL"的时间，但我希望距离能够超过8分钟(我认为3-5分钟不会是不合理的).

While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).

较慢的方法包括:

36分钟-sqlalchemy的table.insert(来自"test_sqlalchemy_core"，此处 https://docs.sqlalchemy.org/en/latest/faq/performance.html#im-inserting-400 -000行的ORM，并且它真的很慢)

36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)

13分钟-psycopg2.extras.execute_batch( https://stackoverflow.com/a/52124686/3979391 )

13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)

13-15分钟(取决于块大小)-pandas.dataframe.to_sql(再次使用sqlalchemy)(

13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)

最好的方式(约8分钟)是使用psycopg2的cursor.copy_from方法(位于此处:

Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541). This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.

所以，我的问题:

1)任何人都有任何可能更快的方法将数以百万计的行从pandas数据框写入到postgresql?

1) Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?

2)cursor.copy_from方法的文档( http://initd.org/psycopg/docs/cursor.html )指出源对象需要支持read()和readline()方法(因此需要io.StringIO).据推测，如果数据框支持这些方法，则可以省去对csv的写入.有什么方法可以添加这些方法吗?

2) The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?

谢谢. 吉尔斯

数据帧写入Postgresql性能不佳 [英] Dataframe writing to Postgresql poor performance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

数据帧写入Postgresql性能不佳 [英] Dataframe writing to Postgresql poor performance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭