如何使Python / PostgreSQL更快? [英] How do you make Python / PostgreSQL faster?

查看:216
本文介绍了如何使Python / PostgreSQL更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我有一个日志解析器可以读取515mb的纯文本文件(过去4年每天的文件)。我的代码目前是这样的: http://gist.github.com/12978 。我使用了psyco(如代码所示),并且还在对其进行编译并使用已编译的版本。每0.3秒大约处理100条线路。该机器是标准的15英寸MacBook Pro(2.4ghz C2D,2GB内存)

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)

这是否有可能运行得更快或者是语言/数据库的限制? ?

Is it possible for this to go faster or is that a limitation on the language/database?

推荐答案

不要浪费时间进行分析,时间始终在数据库操作中,请执行尽可能少的操作。

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.

三件事。

一件事。不要一遍又一遍地选择以符合规范Date,Hostname和Person维度。一次将所有数据提取到Python字典中并在内存中使用。不要重复进行单例选择。使用Python。

One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.

两个

具体来说,不要这样做。这是不好的代码,原因有两个。

Specifically, Do not do this. It's bad code for two reasons.

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)

将其替换为简单的SELECT COUNT(*)FROM ...。从不更新以增加计数。带有SELECT语句的位置。 [如果您无法通过简单的SELECT COUNT或SELECT COUNT(DISTINCT)来完成此操作,则可能会丢失一些数据-您的数据模型应始终提供正确的完整计数。永远不要更新。]

It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]

然后。切勿使用字符串替换构建SQL。完全愚蠢。

And. Never build SQL using string substitution. Completely dumb.

如果由于某种原因, SELECT COUNT(*)不够快(首先进行基准测试) ,在执行任何操作之前,您可以将计数结果缓存在另一个表中。在所有负载之后。从任何GROUP BY进行 SELECT COUNT(*)并将其插入到计数表中。不要更新。

If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.

三个。使用绑定变量。

Three. Use Bind Variables. Always.

cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )

SQL从不更改。值必然会更改,但是SQL永远不会更改。这要快得多。切勿动态构建SQL语句。决不。

The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

这篇关于如何使Python / PostgreSQL更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆