Django 多处理和数据库连接 [英] Django multiprocessing and database connections

查看:31
本文介绍了Django 多处理和数据库连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:

我正在处理一个使用 Django 和 Postgres 数据库的项目.我们也在使用 mod_wsgi 以防万一,因为我的一些网络搜索已经提到了它.在 Web 表单提交时,Django 视图启动一项需要大量时间(比用户希望等待的时间更长)的作业,因此我们通过后台的系统调用启动作业.现在正在运行的作业需要能够读取和写入数据库.由于这项工作需要很长时间,我们使用多处理并行运行其中的一部分.

I'm working a project which uses Django with a Postgres database. We're also using mod_wsgi in case that matters, since some of my web searches have made mention of it. On web form submit, the Django view kicks off a job that will take a substantial amount of time (more than the user would want to wait), so we kick off the job via a system call in the background. The job that is now running needs to be able to read and write to the database. Because this job takes so long, we use multiprocessing to run parts of it in parallel.

问题:

顶级脚本有一个数据库连接,当它产生子进程时,似乎父级的连接对子级可用.然后有一个关于如何在查询之前调用 SET TRANSACTION ISOLATION LEVEL 的例外.研究表明,这是由于尝试在多个进程中使用相同的数据库连接.我发现一个线程建议在子进程开始时调用 connection.close() 以便 Django 在需要时自动创建一个新连接,因此每个子进程都将有一个唯一的连接 - 即不共享.这对我不起作用,因为在子进程中调用 connection.close() 会导致父进程抱怨连接丢失.

The top level script has a database connection, and when it spawns off child processes, it seems that the parent's connection is available to the children. Then there's an exception about how SET TRANSACTION ISOLATION LEVEL must be called before a query. Research has indicated that this is due to trying to use the same database connection in multiple processes. One thread I found suggested calling connection.close() at the start of the child processes so that Django will automatically create a new connection when it needs one, and therefore each child process will have a unique connection - i.e. not shared. This didn't work for me, as calling connection.close() in the child process caused the parent process to complain that the connection was lost.

其他发现:

我读到的一些东西似乎表明你不能真正做到这一点,而且多处理、mod_wsgi 和 Django 不能很好地协同工作.我猜这似乎很难相信.

Some stuff I read seemed to indicate you can't really do this, and that multiprocessing, mod_wsgi, and Django don't play well together. That just seems hard to believe I guess.

有些人建议使用 celery,这可能是一个长期的解决方案,但我目前无法安装 celery,等待一些审批流程,所以现在不是一个选项.

Some suggested using celery, which might be a long term solution, but I am unable to get celery installed at this time, pending some approval processes, so not an option right now.

在 SO 和其他地方找到了一些关于持久数据库连接的参考资料,我认为这是一个不同的问题.

Found several references on SO and elsewhere about persistent database connections, which I believe to be a different problem.

还找到了对 psycopg2.pool 和 pgpool 的引用,以及一些关于 bouncer 的内容.诚然,我不明白我在这些书上读到的大部分内容,但它肯定没有让我突然想到我正在寻找的东西.

Also found references to psycopg2.pool and pgpool and something about bouncer. Admittedly, I didn't understand most of what I was reading on those, but it certainly didn't jump out at me as being what I was looking for.

当前的变通办法":

就目前而言,我已经恢复到仅串行运行的方式,它可以工作,但比我想要的要慢.

For now, I've reverted to just running things serially, and it works, but is slower than I'd like.

关于如何使用多处理并行运行的任何建议?似乎如果我能让父母和两个孩子都拥有与数据库的独立连接,一切都会好起来的,但我似乎无法理解这种行为.

Any suggestions as to how I can use multiprocessing to run in parallel? Seems like if I could have the parent and two children all have independent connections to the database, things would be ok, but I can't seem to get that behavior.

谢谢,对不起!

推荐答案

Multiprocessing 复制进程之间的连接对象,因为它派生了进程,因此复制了父进程的所有文件描述符.话虽如此,与 SQL 服务器的连接只是一个文件,您可以在 linux 下的/proc//fd/... 下看到它.任何打开的文件都将在分叉进程之间共享.您可以在此处找到有关分叉的更多信息.

Multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the SQL server is just a file, you can see it in linux under /proc//fd/.... any open file will be shared between forked processes. You can find more about forking here.

我的解决方案只是在启动进程之前简单地关闭数据库连接,每个进程在需要时重新创建连接(在 django 1.4 中测试):

My solution was just simply close db connection just before launching processes, each process recreate connection itself when it will need one (tested in django 1.4):

from django import db
db.connections.close_all()
def db_worker():      
    some_paralell_code()
Process(target = db_worker,args = ())

pgbouncer/pgpool 在多处理的意义上不与线程连接.这是不关闭每个请求的连接的解决方案 = 在高负载下加速连接到 postgres.

Pgbouncer/pgpool is not connected with threads in a meaning of multiprocessing. It's rather solution for not closing connection on each request = speeding up connecting to postgres while under high load.

更新:

要完全消除数据库连接的问题,只需将与数据库连接的所有逻辑移动到 db_worker - 我想将 QueryDict 作为参数传递......更好的想法是简单地传递 id 列表......参见 QueryDict 和 values_list('id', flat=True),不要忘记把它列出来!list(QueryDict) 在传递给 db_worker 之前.因此,我们不复制模型数据库连接.

To completely remove problems with database connection simply move all logic connected with database to db_worker - I wanted to pass QueryDict as an argument... Better idea is simply pass list of ids... See QueryDict and values_list('id', flat=True), and do not forget to turn it to list! list(QueryDict) before passing to db_worker. Thanks to that we do not copy models database connection.

def db_worker(models_ids):        
    obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids)
    obj.run()


model_ids = Model.objects.all().values_list('id', flat=True)
model_ids = list(model_ids) # cast to list
process_count = 5
delta = (len(model_ids) / process_count) + 1

# do all the db stuff here ...

# here you can close db connection
from django import db
db.connections.close_all()

for it in range(0:process_count):
    Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta]))   

这篇关于Django 多处理和数据库连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆