从SQL Server导入大量数据集到HDF5 [英] Import huge data-set from SQL server to HDF5

查看:154
本文介绍了从SQL Server导入大量数据集到HDF5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试将约8列的1200万条记录导入Python,由于其巨大的内存,我的笔记本电脑内存不足以满足此要求.现在,我正在尝试将SQL数据导入HDF5文件格式.如果有人可以共享一段从SQL查询数据并将其成块保存为HDF5格式的代码片段,那将非常有帮助.我愿意使用任何其他更易于使用的文件格式.

I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use.

我计划进行一些基本的探索性分析,以后可能会使用熊猫创建一些决策树/Liner回归模型.

I plan to do some basic exploratory analysis and later on might create some decision trees/Liner regression models using pandas.

import pyodbc 
import numpy as np
import pandas as pd

con = pyodbc.connect('Trusted_Connection=yes',
                     driver = '{ODBC Driver 13 for SQL Server}',
                     server = 'SQL_ServerName')
df = pd.read_sql("select * from table_a",con,index_col=['Accountid'],chunksize=1000)

推荐答案

尝试一下:

sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)

hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]

for chunk in sql_reader:
     store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

这篇关于从SQL Server导入大量数据集到HDF5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆