并行化 pandas pyodbc SQL数据库调用 [英] Parallelizing pandas pyodbc SQL database calls

查看:57
本文介绍了并行化 pandas pyodbc SQL数据库调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在通过pandas.io.sql.read_sql()命令将数据查询到数据帧中.我想并行化这些家伙提倡的调用:(使用Python令人尴尬的并行数据库调用(PyData巴黎2015))

I am currently querying data into dataframe via the pandas.io.sql.read_sql() command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))

类似(很普通)的东西

pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)

有可能吗?

推荐答案

是的,这很有用,尽管需要警告的是,在您所讨论的那场演讲中,您需要更改parallel_connection.py.在该代码中,有一个fetchall函数可以并行执行每个游标,然后组合结果.这是您要更改的核心:

Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:

旧代码:

def fetchall(self):
    results = [None] * len(self.cursors)
    def do_work(index, cursor):
        results[index] = cursor.fetchall()
    self._do_parallel(do_work)
    return list(chain(*[rs for rs in results]))

新代码:

def fetchall(self):
    results = [None] * len(self.sql_connections)
    def do_work(index, sql_connection):
        sql, conn = sql_connection  #  Store tuple of sql/conn instead of cursor
        results[index] = pd.read_sql(sql, conn)
    self._do_parallel(do_work)
    return pd.DataFrame().append([rs for rs in results])

回购: https://github.com/godatadriven/ParallelConnection

这篇关于并行化 pandas pyodbc SQL数据库调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆