如何在不耗尽内存的情况下从 sql 查询创建大型 Pandas 数据框? [英] How to create a large pandas dataframe from an sql query without running out of memory?

查看:22
本文介绍了如何在不耗尽内存的情况下从 sql 查询创建大型 Pandas 数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法从 MS SQL Server 数据库查询超过 500 万条记录的表.我想选择所有记录,但是在将大量数据选择到内存中时,我的代码似乎失败了.

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.

这有效:

import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

...但这不起作用:

sql = "SELECT TOP 2000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

它返回这个错误:

File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandaslib.c:42733) Memory Error

我已阅读此处创建时存在类似问题dataframe 来自 csv 文件,解决方法是使用迭代器"和块大小"参数,如下所示:

I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:

read_csv('exp4326.csv', iterator=True, chunksize=1000)

是否有类似的从 SQL 数据库查询的解决方案?如果没有,首选的解决方法是什么?我应该使用其他一些方法来分块读取记录吗?我在这里阅读了一些关于在熊猫中使用大型数据集的讨论,但是执行 SELECT * 查询似乎需要做很多工作.当然还有更简单的方法.

Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

推荐答案

更新:请务必查看下面的答案,因为 Pandas 现在内置了对分块加载的支持.

您可以简单地尝试逐块读取输入表,然后从各个部分组装完整的数据框,如下所示:

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
  sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset) 
  dfs.append(psql.read_frame(sql, cnxn))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    break
full_df = pd.concat(dfs)

也有可能整个数据框太大而无法放入内存,在这种情况下,除了限制您选择的行数或列数外,您别无选择.

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

这篇关于如何在不耗尽内存的情况下从 sql 查询创建大型 Pandas 数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆