无需使用BULK INSERT或pandas to_sql,即可加快从CSV文件到SQL Server的插入 [英] Speed up insert to SQL Server from CSV file without using BULK INSERT or pandas to_sql

查看:53
本文介绍了无需使用BULK INSERT或pandas to_sql,即可加快从CSV文件到SQL Server的插入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将Pandas数据框整体放在MS SQL Server数据库的表中.像我这样的普通用户不允许批量插入.我正在使用pyodbc连接到我的数据库.我正在使用Pandas 0.13.1.我在某处读到,从0.14版开始,您可以使用to_sql方法,因此该方法不适用于我的pandas数据框.因此,我使用了迭代器.我的数据框有2列:Col1和Col2.

I want to put a Pandas dataframe as a whole in a table in a MS SQL Server database. BULK INSERT is not allowed for common users like myself. I am using pyodbc to connect to my database. I am using Pandas 0.13.1. I read somewhere that from version 0.14 you can use the to_sql method and thus that it is unavailable for my pandas dataframe. Therefore I used an iterator. My dataframe has 2 columns: Col1 and Col2.

我的代码正在运行,看起来像这样:

My code is working and looks like:

from pyodbc import connect
import pandasas pd

df = pd.read_csv('PathToMyCSVfile', sep=';', header=0)

cnxn = connect(DRIVER = '{SQL Server}', SERVER = 'MyServer', DATABASE = 'MyDatabase')
cursor = cnxn.cursor()

for index, row in df.interrows():
  cursor.execute("INSERT INTO MySchema.MyTable VALUES (?,?)", df['Col1'][index], def['Col2'][index]
  cnxn.commit()

如前所述,上面的代码可以正常工作,但是速度很慢... 我该怎么做才能加快速度?

As said, above code is working, but it is slow... What can I do to speed things up?

推荐答案

您面临的瓶颈是您的代码为DataFrame中的每一行发送INSERT语句.也就是说,对于示例数据文件

The bottleneck you face is that your code sends an INSERT statement for each row in the DataFrame. That is, for a sample data file

id;txt
1;alpha
2;bravo
3;charlie
4;delta
5;echo
6;foxtrot
7;golf

您将需要七(7)次往返服务器,才能发送等价的

you would need seven (7) round-trips to the server to send the equivalent of

INSERT INTO MySchema.MyTable VALUES (1,'alpha')
INSERT INTO MySchema.MyTable VALUES (2,'bravo')
INSERT INTO MySchema.MyTable VALUES (3,'charlie')
...
INSERT INTO MySchema.MyTable VALUES (7,'golf')

通过使用表值构造器可以在一次往返中完成相同的操作:

You could speed that up significantly by using a Table Value Constructor to do the same thing in one round-trip:

INSERT INTO MySchema.MyTable VALUES (1,'alpha'),(2,'bravo'),(3,'charlie'), ... ,(7,'golf')

以下代码就是这样做的.当我使用具有5000行的文件对其进行测试时,使用rows_per_batch=1000(最大)运行它的速度大约是使用rows_per_batch=1(相当于当前方法)运行的速度的100倍.

The following code does just that. When I tested it using a file with 5000 rows, running it with rows_per_batch=1000 (the maximum) was about 100 times faster than with rows_per_batch=1 (the equivalent of your current approach).

import numpy
import pandas as pd
import pyodbc
import time


class MyDfInsert:
    def __init__(self, cnxn, sql_stub, data_frame, rows_per_batch=1000):
        # NB: hard limit is 1000 for SQL Server table value constructor
        self._rows_per_batch = 1000 if rows_per_batch > 1000 else rows_per_batch

        self._cnxn = cnxn
        self._sql_stub = sql_stub
        self._num_columns = None
        self._row_placeholders = None
        self._num_rows_previous = None
        self._all_placeholders = None
        self._sql = None

        row_count = 0
        param_list = list()
        for df_row in data_frame.itertuples():
            param_list.append(tuple(df_row[1:]))  # omit zero-based row index
            row_count += 1
            if row_count >= self._rows_per_batch:
                self._send_insert(param_list)  # send a full batch
                row_count = 0
                param_list = list()
        self._send_insert(param_list)  # send any remaining rows

    def _send_insert(self, param_list):
        if len(param_list) > 0:
            if self._num_columns is None:
                # print('[DEBUG] (building items that depend on the number of columns ...)')
                # this only happens once
                self._num_columns = len(param_list[0])
                self._row_placeholders = ','.join(['?' for x in range(self._num_columns)])
                # e.g. '?,?'
            num_rows = len(param_list)
            if num_rows != self._num_rows_previous:
                # print('[DEBUG] (building items that depend on the number of rows ...)')
                self._all_placeholders = '({})'.format('),('.join([self._row_placeholders for x in range(num_rows)]))
                # e.g. '(?,?),(?,?),(?,?)'
                self._sql = f'{self._sql_stub} VALUES {self._all_placeholders}'
                self._num_rows_previous = num_rows
            params = [int(element) if isinstance(element, numpy.int64) else element
                      for row_tup in param_list for element in row_tup]
            # print('[DEBUG]    sql: ' + repr(self._sql))
            # print('[DEBUG] params: ' + repr(params))
            crsr = self._cnxn.cursor()
            crsr.execute(self._sql, params)


if __name__ == '__main__':
    conn_str = (
        'DRIVER=ODBC Driver 11 for SQL Server;'
        'SERVER=192.168.1.134,49242;'
        'Trusted_Connection=yes;'
    )
    cnxn = pyodbc.connect(conn_str, autocommit=True)
    crsr = cnxn.cursor()
    crsr.execute("CREATE TABLE #tmp (id INT PRIMARY KEY, txt NVARCHAR(50))")

    df = pd.read_csv(r'C:\Users\Gord\Desktop\Query1.txt', sep=';', header=0)

    t0 = time.time()

    MyDfInsert(cnxn, "INSERT INTO #tmp (id, txt)", df, rows_per_batch=1000)

    print()
    print(f'Inserts completed in {time.time() - t0:.2f} seconds.')

    cnxn.close()

这篇关于无需使用BULK INSERT或pandas to_sql,即可加快从CSV文件到SQL Server的插入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆