在python中读取巨大的MySQL表的最快方法 [英] Fastest way to read huge MySQL table in python

查看:72
本文介绍了在python中读取巨大的MySQL表的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图读取一个由数百万行组成的非常大的 MySQL 表.我使用过 Pandas 库和 chunks.看下面的代码:

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

import pandas as pd
import numpy as np
import pymysql.cursors

connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM example_table;"

        chunks=[]

        for chunk in pd.read_sql(query, connection, chunksize = 1000):
            chunks.append(chunk)
        #print(len(chunks))    
        result = pd.concat(chunks, ignore_index=True)
        #print(type(result))
        #print(result)

finally:
    print("Done!")

    connection.close()

如果我限制要选择的行数,实际上执行时间是可以接受的.但如果只想选择最少的数据(例如 100 万行),则执行时间会显着增加.

Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

也许有更好/更快的方法来从 python 中的关系数据库中选择数据?

Maybe is there a better/faster way to select the data from a relational database within python?

推荐答案

另一种选择可能是使用 multiprocessing 模块,将查询拆分并发送给多个并行进程,然后连接结果.

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

不太了解 pandas 分块 - 我认为您必须手动进行分块(这取决于数据)...不要使用 LIMIT/OFFSET - 性能会很糟糕.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

这可能不是一个好主意,具体取决于数据.如果有一种有用的方法来拆分查询(例如,如果它是一个时间序列,或者有某种适当的索引列可供使用,这可能是有意义的).我在下面放了两个例子来展示不同的情况.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data) 

示例 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

可能是更好的方法(并且没有正确测试等).如果你尝试一下,有兴趣知道它是怎么回事.

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

这篇关于在python中读取巨大的MySQL表的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆