psycopg2等效于R dbWriteTable命令,并从python代码获得更多性能 [英] psycopg2 equivalent to R dbWriteTable command and getting more performance from python code

查看:83
本文介绍了psycopg2等效于R dbWriteTable命令,并从python代码获得更多性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是做了一次时间测试,从csv加载数据数组,在Postgres上创建数据库,然后在python和R中将表写入其中。

I just did a time test on loading a data array from a csv, creating a database on Postgres and writing the table to it in python and R.

I惊讶的是时间非常相似:

I was surprised that the times were very similar:

首先是python代码:(例如)

The python code first: (as an e.g)

import timeit
tic = timeit.default_timer()
tic4 = timeit.default_timer()
import xlrd as xl
import psycopg2 as pq
import os
import pandas as pd 
import numpy as np
import csv
from pprint import pprint as pp

perf_dir = '/myhomedir'
toc4=timeit.default_timer()

# Create the database
tic1= timeit.default_timer()
os.system('dropdb ptest')
os.system('createdb ptest')
# connect to the database
cn = pq.connect('dbname=ptest user=me')
cr = cn.cursor()
toc1=timeit.default_timer()

# Create the tables:
### load csvs
tic2=timeit.default_timer()
id_files = ('di1','di2','di.c')
id_files = [i+r'.csv' for i in id_files]
id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t')
id1 = [i for i in id1]

id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
id2 = [i for i in id2]

id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t')
id3 = [i for i in id3]
id3 = [i[1:3] for i in id3]
toc2=timeit.default_timer()

# create tables and fill
### id1 fund classifications
tic3=timeit.default_timer()
cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0]))
FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0]) 
SQL = FLDS + ' (%s,%s,%s,%s,%s)'
for i in range(1,len(id1)):
    data = tuple(id1[i])
    cr.execute(SQL,data)

### id2 portfolio group classifications - reference only
cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0]))
SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)'
for i in range(1,len(id2)):
    data = tuple(id2[i])
    cr.execute(SQL,data)

### id3 value variable classifications
cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0]))
SQL = 'INSERT INTO id3 VALUES(%s,%s)'
for i in range(1,len(id3)):
    data = tuple(id3[i])
    cr.execute(SQL,data)

cn.commit()

# Timing block - will be commented out in final code
toc3=timeit.default_timer()
toc = timeit.default_timer()
time = (toc - tic)
time1 = toc1 - tic1
time2 = toc2 - tic2
time3 = toc3 - tic3
time4 = toc4 - tic4
print('Overall time: %s' % time)
print('dB create & connect time: %s' % time1)
print('Load id csvs time: %s' % time2)
print('Create tables and write to db time: %s' % time3)
print('Time to import libraries: %s' % time4)

和R代码(

tic = proc.time()
library(RPostgreSQL)

tic1 = proc.time()
system('dropdb ptest1')
system('createdb ptest1')
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname='ptest1')
toc1 = proc.time()
time1 = toc1 - tic1

tic2 = proc.time()
id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t')
id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F)
id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t')
id.3 = id.3[,-1]
toc2 = proc.time()
time2 = toc2 - tic2

tic3 = proc.time()
dbWriteTable(con,'id1',id.1)
dbWriteTable(con,'id2',id.2)
dbWriteTable(con,'id3',id.3)
toc3 = proc.time()
time3 = toc3 - tic3

toc = proc.time()
time = toc - tic
tyme = rbind(time1,time2,time3,time)
tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV's for save","Write Table to DB",'Overall Time'),tyme)

我对这两个时间如此接近感到非常惊讶。 (我已经读到了很多关于R慢而Python极其快的信息)

I was very surprised at how close the times for the two were. (I've read an awful lot on R being slow and Python being extremely fast)

对于python

>>> Overall time: 2.48381304741
dB create & connect time: 1.96832108498
Load id csvs time: 0.000378847122192
Create tables and write to db time: 0.35303401947
Time to import libraries: 0.162075042725

并且对于R

                    Function user.self sys.self elapsed user.child sys.child
time1 Create & Connect to DB     0.112    0.016   1.943       0.06     0.004
time2    Load CSV's for save     0.008    0.000   0.006       0.00     0.000
time3      Write Table to DB     0.096    0.004   0.349       0.00     0.000
time            Overall Time     0.376    0.028   2.463       0.06     0.004

我想知道它是否与我 INSERT的事实有关一次在表格的python版本中排一行。

I wondered if it had something to with the fact that I'm INSERTing a row at at time into the python version of the table.

因此,主要问题-<$ c是否在python中等效R代码中的$ c> dbWriteTable 块会加快速度吗?

Hence the main question - is there an equivalent in python for the dbWriteTable block in the R code and would it speed things up?

第二个辅助问题是:显然有什么问题吗?可能会使速度变慢的代码。

A second ancillary question would be is there anything obviously wrong with the code that might be slowing things down.

很高兴提供示例csv,如果有帮助的话。

Happy to provide sample csv's if that would help.

不希望对R v发动火焰战争Python,只想知道如何使我的代码更快。

Not looking to start a flame war on R v Python, would just like to know how I can make my code faster.

谢谢

推荐答案

尝试以下操作:

    id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
    h = tuple(next(id2))

    create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h
    insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h

    ...

    cr.executemany(insert, id2)

对所有行仅传递一次sql命令。除了发出额外的sql命令的性能开销外,请注意sql字符串本身的大小(以字节为单位)可能使被测真实数据的大小相形见

Pass the sql command only once for all the rows. In addition to the performance overhead of issuing additional sql commands, notice that the size of the sql string itself (in bytes) likely dwarfs the size of the real data under test.

使用id2作为没有数组的生成器还应该避免将整个数据集加载到内存中。

Using id2 as a generator without the array should also avoid loading the entire data set into memory.

这篇关于psycopg2等效于R dbWriteTable命令,并从python代码获得更多性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆