与R相比,将mysql表加载到python中需要花费很长时间 [英] Loading mysql table into python takes a very long time compared to R

查看:69
本文介绍了与R相比,将mysql表加载到python中需要花费很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的mysql表,大约3000万行,6列,加载到内存时大约2gb.

I have a fairly large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.

我同时使用python和R.在R中,我可以将表加载到内存中,大约需要90秒.但是在python中需要40分钟.

I work with both python and R. In R, I can load the table into memory and it takes about 90 seconds. But in python it takes 40 minutes.

我已经尝试过使用sqlalchemy和纯pymysql.该代码非常简单,例如,使用sqlalchemy:

I've tried it both with sqlalchemy and with plain pymysql. The code is straightforward, for example, with sqlalchemy:

db_engine = sqlalchemy.create_engine("mysql+pymysql://user:pass@host/database")
cnx = db_engine.connect()
table = pd.read_sql('select * from my_table',cnx)
cnx.close()

没有sqlalchemy:

Without sqlalchemy:

cnx = mysql.connector.connect(**db_details)
cursor = cnx.cursor()
cursor.execute('select * from my_table')
table = pd.DataFrame(data=list(cursor),columns = cursor.column_names)
cnx.close()

无论哪种方式,它都比R慢很多很多,这对我来说没有多大意义.为什么会这样,有什么方法可以加快速度吗?即使是骇客也可以.

Either way, it is much, much slower than R, which doesn't make much sense to me. Why is this, and is there any way to speed it up? Even a hack would do.

要补充一点,大熊猫与它花了很长时间没有任何关系.在第二个代码段中,如果我只返回list(cursor)而不是将其放入pandas DataFrame中,则它(基本上)所需的时间就一样长.

To add, pandas doesn't have anything to do with it taking so long. In the second code snippet, if I just return list(cursor) instead of putting it into a pandas DataFrame, it takes (basically) just as long.

在编辑时: 该数据库与R/Python在同一台计算机上运行,​​因此就吞吐量而言,所有内容都应相同.

On edit: The DB is running on the same machine as R/Python, so everything should be identical in terms of throughput.

在R中,我正在使用DBI,而我使用的R代码(基本上)是这样的:

In R I am using DBI, and the R code I use is (basically) this:

require(DBI)
cnx <- dbConnect(dbDriver("MySQL"),dbname="database",username="user",password="pass",host="host")
table <- dbGetQuery(cnx,"select * from my_table")

********已解决(最多)********

******** RESOLVED (MOSTLY) ********

感谢有用的评论,尤其是@roganjosh的评论,看来问题是默认的mysql连接器是用python而不是C编写的,这使它非常慢.解决方案是使用MySQLdb,它是本机C连接器.

Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb, which is a native C connector.

在我的特定设置中,无法通过anaconda运行python 3,因为python 2仅支持MySQLdb.但是,对于python 3,有一个MySQLdb的实现,名称为mysqlclient.

In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb is only supported in python 2. However, there is an implementation of MySQLdb for python 3 under the name mysqlclient.

使用此实现,现在读取整个表的时间减少到大约5分钟,速度不及R,但比以前花费的时间少40左右.

Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.

推荐答案

感谢有用的评论,尤其是@roganjosh的评论,看来问题是默认的mysql连接器是用python而不是C编写的,这使得它非常有用.慢.解决方案是使用MySQLdb,它是本机C连接器.

Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb, which is a native C connector.

在我的特定设置中,无法通过anaconda运行python 3,因为MySQLdb仅在python 2中受支持.但是,对于python 3有MySQLdb的实现,名称为mysqlclient.

In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb is only supported in python 2. However, there is an implementation of MySQLdb for python 3 under the name mysqlclient.

使用此实现,现在读取整个表的时间减少到大约5分钟,速度不及R,但比以前需要40左右的时间要少得多.

Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.

我仍然对提高速度的建议持开放态度,但是我猜想这已经足够了.

I'm still open to suggestions that would make it faster, but my guess is that this is as good as it's going to get.

这篇关于与R相比,将mysql表加载到python中需要花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆