将二进制COPY表FROM与psycopg2一起使用 [英] Use binary COPY table FROM with psycopg2

查看:141
本文介绍了将二进制COPY表FROM与psycopg2一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数千万行要从多维数组文件传输到PostgreSQL数据库.我的工具是Python和psycopg2.批量插入数据的最有效方法是使用 copy_from .但是,我的数据主要是32位浮点数(实数或float4),所以我宁愿不从实数→文本→实数转换.这是一个示例数据库DDL:

I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:

CREATE TABLE num_data
(
  id serial PRIMARY KEY NOT NULL,
  node integer NOT NULL,
  ts smallint NOT NULL,
  val1 real,
  val2 double precision
);

这是我在Python中使用字符串(文本)的地方:

Here is where I'm at with Python using strings (text):

# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]

import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')

# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()

是否存在可以使用二进制模式运行的等效项?即,将浮点数保留为二进制?这样不仅可以保持浮点精度,而且可以更快.

Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.

(注意:要使用与示例相同的精度,请使用SET extra_float_digits='2')

(Note: to see the same precision as the example, use SET extra_float_digits='2')

推荐答案

以下是与COPY FROM相对应的Python 3二进制代码:

Here is the binary equivalent of COPY FROM for Python 3:

from io import BytesIO
from struct import pack
import psycopg2

# Two rows of data; "id" is not in the upstream data source
# Columns: node, ts, val1, val2
data = [(23253, 342, -15.336734, 2494627.949375),
        (23256, 348, 43.23524, 2494827.949375)]

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# Determine starting value for sequence
curs.execute("SELECT nextval('num_data_id_seq')")
id_seq = curs.fetchone()[0]

# Make a binary file object for COPY FROM
cpy = BytesIO()
# 11-byte signature, no flags, no header extension
cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))

# Columns: id, node, ts, val1, val2
# Zip: (column position, format, size)
row_format = list(zip(range(-1, 4),
                      ('i', 'i', 'h', 'f', 'd'),
                      ( 4,   4,   2,   4,   8 )))
for row in data:
    # Number of columns/fields (always 5)
    cpy.write(pack('!h', 5))
    for col, fmt, size in row_format:
        value = (id_seq if col == -1 else row[col])
        cpy.write(pack('!i' + fmt, size, value))
    id_seq += 1  # manually increment sequence outside of database

# File trailer
cpy.write(pack('!h', -1))

# Copy data to database
cpy.seek(0)
curs.copy_expert("COPY num_data FROM STDIN WITH BINARY", cpy)

# Update sequence on database
curs.execute("SELECT setval('num_data_id_seq', %s, false)", (id_seq,))
conn.commit()


更新

我改写了上面的方法来为COPY编写文件.我在Python中的数据位于NumPy数组中,因此使用它们很有意义.这是data的示例,该行具有1M行,7列:


Update

I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example data with with 1M rows, 7 columns:

import psycopg2
import numpy as np
from struct import pack
from io import BytesIO
from datetime import datetime

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# NumPy record array
shape = (7, 2000, 500)
print('Generating data with %i rows, %i columns' % (shape[1]*shape[2], shape[0]))

dtype = ([('id', 'i4'), ('node', 'i4'), ('ts', 'i2')] +
         [('s' + str(x), 'f4') for x in range(shape[0])])
data = np.empty(shape[1]*shape[2], dtype)
data['id'] = np.arange(shape[1]*shape[2]) + 1
data['node'] = np.tile(np.arange(shape[1]) + 1, shape[2])
data['ts'] = np.repeat(np.arange(shape[2]) + 1, shape[1])
data['s0'] = np.random.rand(shape[1]*shape[2]) * 100
prv = 's0'
for nxt in data.dtype.names[4:]:
    data[nxt] = data[prv] + np.random.rand(shape[1]*shape[2]) * 10
    prv = nxt

在我的数据库上,我有两个看起来像这样的表:

On my database, I have two tables that look like:

CREATE TABLE num_data_binary
(
  id integer PRIMARY KEY,
  node integer NOT NULL,
  ts smallint NOT NULL,
  s0 real,
  s1 real,
  s2 real,
  s3 real,
  s4 real,
  s5 real,
  s6 real
) WITH (OIDS=FALSE);

和另一个名为num_data_text的表.

以下是一些简单的辅助函数,它们通过使用NumPy记录数组中的信息为COPY(文本和二进制格式)准备数据:

Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:

def prepare_text(dat):
    cpy = BytesIO()
    for row in dat:
        cpy.write('\t'.join([repr(x) for x in row]) + '\n')
    return(cpy)

def prepare_binary(dat):
    pgcopy_dtype = [('num_fields','>i2')]
    for field, dtype in dat.dtype.descr:
        pgcopy_dtype += [(field + '_length', '>i4'),
                         (field, dtype.replace('<', '>'))]
    pgcopy = np.empty(dat.shape, pgcopy_dtype)
    pgcopy['num_fields'] = len(dat.dtype)
    for i in range(len(dat.dtype)):
        field = dat.dtype.names[i]
        pgcopy[field + '_length'] = dat.dtype[i].alignment
        pgcopy[field] = dat[field]
    cpy = BytesIO()
    cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))
    cpy.write(pgcopy.tostring())  # all rows
    cpy.write(pack('!h', -1))  # file trailer
    return(cpy)

这是我如何使用辅助函数对两种COPY格式方法进行基准测试的方法:

This how I'm using the helper functions to benchmark the two COPY format methods:

def time_pgcopy(dat, table, binary):
    print('Processing copy object for ' + table)
    tstart = datetime.now()
    if binary:
        cpy = prepare_binary(dat)
    else:  # text
        cpy = prepare_text(dat)
    tendw = datetime.now()
    print('Copy object prepared in ' + str(tendw - tstart) + '; ' +
          str(cpy.tell()) + ' bytes; transfering to database')
    cpy.seek(0)
    if binary:
        curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)
    else:  # text
        curs.copy_from(cpy, table)
    conn.commit()
    tend = datetime.now()
    print('Database copy time: ' + str(tend - tendw))
    print('        Total time: ' + str(tend - tstart))
    return

time_pgcopy(data, 'num_data_text', binary=False)
time_pgcopy(data, 'num_data_binary', binary=True)

这是最后两个time_pgcopy命令的输出:

Here is the output from the last two time_pgcopy commands:

Processing copy object for num_data_text
Copy object prepared in 0:01:15.288695; 84355016 bytes; transfering to database
Database copy time: 0:00:37.929166
        Total time: 0:01:53.217861
Processing copy object for num_data_binary
Copy object prepared in 0:00:01.296143; 80000021 bytes; transfering to database
Database copy time: 0:00:23.325952
        Total time: 0:00:24.622095

因此,使用二进制方法时,NumPy→文件和File→数据库步骤都更快.明显的区别是Python如何准备COPY文件,这对于文本来说确实很慢.一般来说,二进制格式会以这种格式的文本格式在2/3的时间内加载到数据库中.

So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.

最后,我比较了数据库中两个表中的值,以查看数字是否不同.大约1.46%的行的s0列具有不同的值,而对于s6的这一部分增加到6.17%(可能与我使用的随机方法有关).所有70M 32位浮点值之间的非零绝对差值介于9.3132257e-010和7.6293945e-006之间.文本加载方法和二进制加载方法之间的微小差异是由于文本格式方法所需的float→text→float转换导致精度下降所致.

Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column s0, and this fraction increases to 6.17% for s6 (probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.

这篇关于将二进制COPY表FROM与psycopg2一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆