Python-PostgreSQL psycopg2接口 - > executemany [英] Python-PostgreSQL psycopg2 interface --> executemany

查看:543
本文介绍了Python-PostgreSQL psycopg2接口 - > executemany的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在分析wikipedia转储文件;我使用python从它提取一堆数据,并将其持久化到一个PostgreSQL数据库。我总是试图让事情去更快的这个文件是巨大的(18GB)。为了与PostgreSQL接口,我使用psycopg2,但这个模块似乎模仿许多其他这样的DBAPIs。



无论如何,我有一个问题关于cursor.executemany ,values);在我看来像每1000个值执行一个executemany一次,或者比调用cursor.execute(命令%value)为这500万个值的每一个(请确认或纠正我!)。



但是,你看,我使用执行程序将1000行插入到具有UNIQUE完整性约束的表中;这个约束不是在python预先验证,因为这可能要求我一直SELECT(这似乎反效率)或要求我获得超过3 GB的RAM。所有这一切,我说Postgres警告我,当我的脚本尝试通过捕获psycopg2.DatabaseError插入已有的行。



当我的脚本检测到这样一个非UNIQUE INSERT时,它的connection.rollback()(每次最多1000行,并且使得执行程序毫无价值)然后逐一插入所有值。



由于psycopg2文档的文档很差(就像许多伟大的模块一样),我找不到一个高效和有效的解决方法。我已经将每个执行程序的INSERT的值从1000减少到100,以减少每个执行程序非UNIQUE INSERT的可能性,但我很确定他们是一个方法来告诉psycopg2忽略这些execeptions或告诉光标继续执行。



基本上,这似乎是一个问题,有一个解决方案这么容易和流行,所有我能做的是问为了了解它。 >

再次感谢!

解决方案

只需将所有数据复制到psql \copy命令,或使用psycopg cursor.copy_in()方法。然后:

  insert into mytable 
select * from(
select distinct *
从头开始
)uniq
其中不存在(
从mytable
中选择1
其中mytable.mykey = uniq.mykey
);

这将比任何插入组合的重复数据删除和运行速度更快。



-dg


I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.

Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).

But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.

When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.

Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.

Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.

Thanks again!

解决方案

just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:

insert into mytable
select * from (
    select distinct * 
    from scratch
) uniq
where not exists (
    select 1 
    from mytable 
    where mytable.mykey = uniq.mykey
);

This will dedup and runs much faster than any combination of inserts.

-dg

这篇关于Python-PostgreSQL psycopg2接口 - > executemany的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆