Psycopg2:将CSV数据复制到具有额外列值的表中 [英] Psycopg2: copy CSV data to table with extra column values
问题描述
我正在使用psycopg2
中的copy_expert
方法将数据从CSV文件复制到Postgresql表.我有这样的postgres表:
I am using the copy_expert
method in psycopg2
to copy data from a CSV file to a postgresql table. I have postgres table like this:
create table my_table (
cola text,
colb text,
colc text,
cold text,
cole text,
colf text,
colg text
)
还有一个CSV,其中包含类似于前五列的数据:
And a CSV containing data like for the first five columns like this:
cola,colb,colc,cold,cole
1,foo,a,10,vvv
2,bar,b,20,www
3,baz,c,30,xxx
4,boo,d,40,yyy
5,baa,e,50,zzz
我想复制前五列的CSV数据,同时还要指定colf
和colg
的值(每行对于colf
和colg
应该具有相同的值).
I would like to copy the CSV data for the first five columns, while also specifying values for colf
and colg
(each row should have the same value for colf
and colg
).
我可以将前五列复制到我的表中,如下所示:
I can copy the first five columns to my table like this:
conn = psycopg2.connect('dbname=name user=username')
cur = conn.cursor()
copy_sql = """
copy my_table (cola, colb, colc, cold, cole)
from stdin with
csv
header
delimiter as ','
"""
from_csv = '/path/to/data.csv'
with open(from_csv, 'r') as f:
cur.copy_expert(sql=copy_sql, file=f)
conn.commit()
cur.close()
我还如何使用python指定最后两列的值?我知道我可以在表DDL中指定默认值,如下所示:
How can I also specify values for the last two columns using python? I know that I can specify default values in the table DDL, like this:
create table my_table (
cola text,
colb text,
colc text,
cold text,
cole text,
colf text default 'foo',
colg text default 'bar'
)
但是我想使用python添加值,因为每个CSV上传的colf
和colg
都有自己的值,这些值由我的python代码中的逻辑决定.
But I'd like to add the values using python, since each CSV upload will have its own values for colf
and colg
, and these values are determined by logic in my python code.
推荐答案
好像有两种方法可以做到这一点,即先将所需的列添加到数据中,然后再上传更新的数据.
Looks like there are a couple of ways to do this by first adding the columns I need to the data, and then uploading the updated data.
使用petl
软件包:
import psycopg2
from petl import fromcsv, addfield, todb
csv_file = '/path/to/data.csv'
table = fromcsv(csv_file)
table = addfield(table, 'colf', 'Some value')
table = addfield(table, 'colg', 'Another value')
conn = psycopg2.connect('dbname=test user=user')
todb(table, conn, 'my_table')
这在小数据上还可以,但是在大数据上却非常慢. psycopg2
copy_from
和copy_expert
命令的执行速度似乎要快得多,因为它们使用了postgresql
批量复制.通过首先将csv文件转换为pandas
dataframe
:
This works okay on small data, but it's incredibly slow on large data. The psycopg2
copy_from
and copy_expert
commands seem to go much faster since they make use of postgresql
bulk copy. I was able to copy my data using copy_from
by first converting my csv file into a pandas
dataframe
:
import psycopg2
import pandas as pd
from StringIO import StringIO
csv_file = '/path/to/file'
df = pd.read_csv(csv_file)
df['colf'] = 'My value'
df['colg'] = 'Foobar'
为了利用psycopg2
copy_
命令,我需要使用read()
和readline()
方法将dataframe
转换为类似文件的对象,我可以使用StringIO
:
In order to make use of the psycopg2
copy_
commands, I need to convert the dataframe
into a file-like object with read()
and readline()
methods, which I can do using StringIO
:
buf = StringIO()
df.to_csv(buf, header=False, index=False)
buf.pos = 0
请注意,您需要将缓冲区的pos
设置为0,因为pandas.to_csv
似乎默认情况下会将pos设置为末尾.有关说明,请参见此SO答案.
Note that you need to set the pos
of the buffer to 0, because pandas.to_csv
seems to set the pos to the end by default. See this SO answer for an explanation.
然后我可以复制该缓冲区对象:
Then I can copy that buffer object:
conn = psycopg2.connect('dbname=test user=user')
cur = conn.cursor()
cur.copy_from(buf, 'my_table', sep=',')
conn.commit()
cur.close()
这篇关于Psycopg2:将CSV数据复制到具有额外列值的表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!