将二进制COPY表FROM与psycopg2一起使用 [英] Use binary COPY table FROM with psycopg2

查看：141 发布时间：2020/5/29 19:49:44 python postgresql bulkinsert binary-data psycopg2

本文介绍了将二进制COPY表FROM与psycopg2一起使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有数千万行要从多维数组文件传输到PostgreSQL数据库.我的工具是Python和psycopg2.批量插入数据的最有效方法是使用 copy_from .但是，我的数据主要是32位浮点数(实数或float4)，所以我宁愿不从实数→文本→实数转换.这是一个示例数据库DDL:

I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:

CREATE TABLE num_data
(
  id serial PRIMARY KEY NOT NULL,
  node integer NOT NULL,
  ts smallint NOT NULL,
  val1 real,
  val2 double precision
);

这是我在Python中使用字符串(文本)的地方:

Here is where I'm at with Python using strings (text):

# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]

import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')

# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()

是否存在可以使用二进制模式运行的等效项?即，将浮点数保留为二进制?这样不仅可以保持浮点精度，而且可以更快.

Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.

(注意:要使用与示例相同的精度，请使用SET extra_float_digits='2')

(Note: to see the same precision as the example, use SET extra_float_digits='2')

更新

我改写了上面的方法来为COPY编写文件.我在Python中的数据位于NumPy数组中，因此使用它们很有意义.这是data的示例，该行具有1M行，7列:

Update

I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example data with with 1M rows, 7 columns:

import psycopg2
import numpy as np
from struct import pack
from io import BytesIO
from datetime import datetime

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# NumPy record array
shape = (7, 2000, 500)
print('Generating data with %i rows, %i columns' % (shape[1]*shape[2], shape[0]))

dtype = ([('id', 'i4'), ('node', 'i4'), ('ts', 'i2')] +
         [('s' + str(x), 'f4') for x in range(shape[0])])
data = np.empty(shape[1]*shape[2], dtype)
data['id'] = np.arange(shape[1]*shape[2]) + 1
data['node'] = np.tile(np.arange(shape[1]) + 1, shape[2])
data['ts'] = np.repeat(np.arange(shape[2]) + 1, shape[1])
data['s0'] = np.random.rand(shape[1]*shape[2]) * 100
prv = 's0'
for nxt in data.dtype.names[4:]:
    data[nxt] = data[prv] + np.random.rand(shape[1]*shape[2]) * 10
    prv = nxt

在我的数据库上，我有两个看起来像这样的表:

On my database, I have two tables that look like:

CREATE TABLE num_data_binary
(
  id integer PRIMARY KEY,
  node integer NOT NULL,
  ts smallint NOT NULL,
  s0 real,
  s1 real,
  s2 real,
  s3 real,
  s4 real,
  s5 real,
  s6 real
) WITH (OIDS=FALSE);

和另一个名为num_data_text的表.

以下是一些简单的辅助函数，它们通过使用NumPy记录数组中的信息为COPY(文本和二进制格式)准备数据:

Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:

def prepare_text(dat):
    cpy = BytesIO()
    for row in dat:
        cpy.write('\t'.join([repr(x) for x in row]) + '\n')
    return(cpy)

def prepare_binary(dat):
    pgcopy_dtype = [('num_fields','>i2')]
    for field, dtype in dat.dtype.descr:
        pgcopy_dtype += [(field + '_length', '>i4'),
                         (field, dtype.replace('<', '>'))]
    pgcopy = np.empty(dat.shape, pgcopy_dtype)
    pgcopy['num_fields'] = len(dat.dtype)
    for i in range(len(dat.dtype)):
        field = dat.dtype.names[i]
        pgcopy[field + '_length'] = dat.dtype[i].alignment
        pgcopy[field] = dat[field]
    cpy = BytesIO()
    cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))
    cpy.write(pgcopy.tostring())  # all rows
    cpy.write(pack('!h', -1))  # file trailer
    return(cpy)

这是我如何使用辅助函数对两种COPY格式方法进行基准测试的方法:

This how I'm using the helper functions to benchmark the two COPY format methods:

def time_pgcopy(dat, table, binary):
    print('Processing copy object for ' + table)
    tstart = datetime.now()
    if binary:
        cpy = prepare_binary(dat)
    else:  # text
        cpy = prepare_text(dat)
    tendw = datetime.now()
    print('Copy object prepared in ' + str(tendw - tstart) + '; ' +
          str(cpy.tell()) + ' bytes; transfering to database')
    cpy.seek(0)
    if binary:
        curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)
    else:  # text
        curs.copy_from(cpy, table)
    conn.commit()
    tend = datetime.now()
    print('Database copy time: ' + str(tend - tendw))
    print('        Total time: ' + str(tend - tstart))
    return

time_pgcopy(data, 'num_data_text', binary=False)
time_pgcopy(data, 'num_data_binary', binary=True)

这是最后两个time_pgcopy命令的输出:

Here is the output from the last two time_pgcopy commands:

Processing copy object for num_data_text
Copy object prepared in 0:01:15.288695; 84355016 bytes; transfering to database
Database copy time: 0:00:37.929166
        Total time: 0:01:53.217861
Processing copy object for num_data_binary
Copy object prepared in 0:00:01.296143; 80000021 bytes; transfering to database
Database copy time: 0:00:23.325952
        Total time: 0:00:24.622095

因此，使用二进制方法时，NumPy→文件和File→数据库步骤都更快.明显的区别是Python如何准备COPY文件，这对于文本来说确实很慢.一般来说，二进制格式会以这种格式的文本格式在2/3的时间内加载到数据库中.

So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.

最后，我比较了数据库中两个表中的值，以查看数字是否不同.大约1.46％的行的s0列具有不同的值，而对于s6的这一部分增加到6.17％(可能与我使用的随机方法有关).所有70M 32位浮点值之间的非零绝对差值介于9.3132257e-010和7.6293945e-006之间.文本加载方法和二进制加载方法之间的微小差异是由于文本格式方法所需的float→text→float转换导致精度下降所致.

Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column s0, and this fraction increases to 6.17% for s6 (probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.

这篇关于将二进制COPY表FROM与psycopg2一起使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将二进制COPY表FROM与psycopg2一起使用 [英] Use binary COPY table FROM with psycopg2

问题描述

推荐答案

更新

Update

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将二进制COPY表FROM与psycopg2一起使用 [英] Use binary COPY table FROM with psycopg2

问题描述

推荐答案

更新

Update

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭