数据帧写入Postgresql性能不佳 [英] Dataframe writing to Postgresql poor performance

查看:130
本文介绍了数据帧写入Postgresql性能不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在postgresql中工作我有一个笛卡尔联接,产生约400万行. 联接大约需要5秒,而写回数据库大约需要1分钟45秒.

working in postgresql I have a cartesian join producing ~4 million rows. The join takes ~5sec and the write back to the DB takes ~1min 45sec.

在python中(特别是在pandas数据框中)需要使用数据,因此我正在尝试在python中复制相同的数据.我应该在这里说所有这些测试都在一台机器上运行,因此网络上没有任何流量.

The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.

使用psycopg2和pandas,读取数据并执行联接以获取400万行(来自此处的答案:

Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.

但是,将数据写回到数据库中的表需要8分钟(最佳方法)到36分钟以上(加上我拒绝的某些方法,因为我必须在1小时后停止这些方法).

Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).

虽然我不希望重现仅SQL"的时间,但我希望距离能够超过8分钟(我认为3-5分钟不会是不合理的).

While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).

较慢的方法包括:

36分钟-sqlalchemy的table.insert(来自"test_sqlalchemy_core",此处 https://docs.sqlalchemy.org/en/latest/faq/performance.html#im-inserting-400 -000行的ORM,并且它真的很慢)

36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)

13分钟-psycopg2.extras.execute_batch( https://stackoverflow.com/a/52124686/3979391 )

13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)

13-15分钟(取决于块大小)-pandas.dataframe.to_sql(再次使用sqlalchemy)(

13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)

最好的方式(约8分钟)是使用psycopg2的cursor.copy_from方法(位于此处:

Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541). This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.

所以,我的问题:

1)任何人都有任何可能更快的方法将数以百万计的行从pandas数据框写入到postgresql?

1) Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?

2)cursor.copy_from方法的文档( http://initd.org/psycopg/docs/cursor.html )指出源对象需要支持read()和readline()方法(因此需要io.StringIO).据推测,如果数据框支持这些方法,则可以省去对csv的写入.有什么方法可以添加这些方法吗?

2) The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?

谢谢. 吉尔斯

推荐答案

回答问题1: 看来这个问题与Postgresql(或更确切地说是数据库)有更多关系.考虑到本文中提出的要点: https://use-the-index -luke.com/sql/dml/insert 我发现了以下内容:

Answering Q 1 myself: It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:

1)从目标表中删除所有索引导致查询在9秒钟内运行. (在postgresql中)重建索引又花了12秒钟,因此在其他时候还是很顺利.

1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.

2)仅在有主键的情况下,插入按主键列排序的行可将花费的时间减少到大约三分之一.这是有道理的,因为所需的索引行应该很少或没有改组.我还验证了这就是为什么我在笛卡尔中进行笛卡尔连接时首先要快的原因(即,行是由索引排序的,纯粹是偶然的缘故),将相同的行放在临时表中(无序)并从中插入实际上花了更长的时间.

2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.

3)我在我们的mysql系统上尝试了类似的实验,发现删除索引时插入速度有所提高.但是,使用mysql似乎可以重建索引,只要花费任何时间就可以用完.

3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.

我希望这对在搜索中遇到此问题的其他人有所帮助.

I hope this helps anyone else who comes across this question from a search.

我仍然想知道是否有可能在python中删除对csv的写操作(上面的第2季度),因为我相信我可以在python中写一些比纯postgresql更快的东西.

I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.

谢谢,吉尔斯

这篇关于数据帧写入Postgresql性能不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆