从MySQL将数字数据加载到python/pandas/numpy数组的最快方法 [英] Fastest way to load numeric data into python/pandas/numpy array from MySQL

查看:313
本文介绍了从MySQL将数字数据加载到python/pandas/numpy数组的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从MySQL表中读取一些数字(双精度,即float64)数据.数据大小约为20万行.

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB参考:

tic;
feature accel off;
conn = database(...);
c=fetch(exec(conn,'select x,y from TABLENAME'));
cell2mat(c.data);
toc

经过的时间约为1秒.

使用此处找到的几个示例在python中进行相同的操作(我已经尝试了所有示例,即使用pandas read_frame,frame_query和__processCursor函数): 如何将SQL查询结果转换为PANDAS数据结构?

Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function): How to convert SQL Query result to PANDAS Data Structure?

参考python代码:

Reference python code:

import pyodbc
import pandas.io.sql as psql
import pandas
connection_info = "DRIVER={MySQL ODBC 3.51 \
Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"
cnxn = pyodbc.connect(connection_info)
cursor = cnxn.cursor()
sql = "select x,y from TABLENAME"
#cursor.execute(sql)
#dataframe = __processCursor(cursor, dataframe=True)
#df = psql.frame_query(sql, cnxn, coerce_float=False)
df = psql.read_frame(sql, cnxn)
cnxn.close()

大约需要6秒钟. Profiler说,所有花费的时间都在read_frame中. 我想知道是否有人可以给我一些提示,如何使它至少与MATLAB代码相匹配.如果在python中完全有可能的话.

Takes ~6 seconds. Profiler says all the time spent was in read_frame. I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.

瓶颈似乎在pyodbc库的cursor.execute(在pymysql库中)或cursor.fetchall()内部.最慢的部分是逐个读取返回的MySQL数据元素(逐行,逐列)并将其转换为先前由同一库推断出的数据类型.

The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.

到目前为止,我已经通过执行这个非常肮脏的解决方案设法将其加快到接近MATLAB的速度:

So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:

import pymysql
import numpy

conn = pymysql.connect(host='', port=, user='', passwd='', db='')
cursor = conn.cursor()
cursor.execute("select x,y from TABLENAME")
rez = cursor.fetchall()
resarray = numpy.array(map(float,rez))
finalres = resarray.reshape((resarray.size/2,2))

上面的cur.execute并不是pymysql执行!我已经在文件"connections.py"中对其进行了修改.首先,函数def _read_rowdata_packet现在具有代替:

The above cur.execute IS NOT THE pymysql EXECUTE! I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:

rows.append(self._read_row_from_packet(packet))

替换为

self._read_string_from_packet(rows,packet)

此处_read_string_from_packet是_read_row_from_packet的简化版本,其代码为:

Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:

def _read_string_from_packet(self, rows, packet):
    for field in self.fields:
        data = packet.read_length_coded_string()
        rows.append(data)

这是一个超肮脏的解决方案,可将速度从6秒降低到2.5秒.我想知道,是否可以通过使用其他库/传递一些参数来避免所有这些情况?

This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?

因此,解决方案是将整个MySQL答复批量读取到一个字符串列表,然后将其批量类型转换为数值数据类型,而不是逐个元素地进行. python中已经存在类似的东西吗?

Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?

推荐答案

问题"似乎是从MySQL的十进制类型到python的十进制类型转换.MySQLdb,pymysql和pyodbc对数据进行十进制转换.通过将MySQLdb中的converters.py文件更改为:

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float

而不是十进制.十进制似乎可以完全解决问题,现在可以解决以下代码:

instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb
import numpy
import time

t = time.time()
conn = MySQLdb.connect(host='',...)
curs = conn.cursor()
curs.execute("select x,y from TABLENAME")
data = numpy.array(curs.fetchall(),dtype=float)
print(time.time()-t)

运行不到一秒钟! 有趣的是,十进制.在探查器中从来没有十进制出现问题.

Runs in less than a second! What is funny, decimal.Decimal never appeared to be the problem in the profiler.

类似的解决方案应该在pymysql包中工作. pyodbc比较棘手:它们都是用C ++编写的,因此您必须重新编译整个程序包.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

更新

以下是不需要修改MySQLdb源代码的解决方案: Python MySQLdb返回datetime.date和十进制 然后,将数字数据加载到熊猫中的解决方案:

Here is a solution not requiring to modify the MySQLdb source code: Python MySQLdb returns datetime.date and decimal The solution then to load numeric data into pandas:

import MySQLdb
import pandas.io.sql as psql
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host='',user='',passwd='',db='')
sql = "select * from NUMERICTABLE"
df = psql.read_frame(sql, conn)

在加载200k x 9的表中,MATLAB的性能降低了约4倍!

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!

这篇关于从MySQL将数字数据加载到python/pandas/numpy数组的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆