Psycopg2:将CSV数据复制到具有额外列值的表中 [英] Psycopg2: copy CSV data to table with extra column values

查看:215
本文介绍了Psycopg2:将CSV数据复制到具有额外列值的表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用psycopg2中的copy_expert方法将数据从CSV文件复制到Postgresql表.我有这样的postgres表:

I am using the copy_expert method in psycopg2 to copy data from a CSV file to a postgresql table. I have postgres table like this:

create table my_table (
  cola          text,
  colb          text,
  colc          text,
  cold          text,
  cole          text,
  colf          text,
  colg          text
)

还有一个CSV,其中包含类似于前五列的数据:

And a CSV containing data like for the first five columns like this:

cola,colb,colc,cold,cole
1,foo,a,10,vvv
2,bar,b,20,www
3,baz,c,30,xxx
4,boo,d,40,yyy
5,baa,e,50,zzz

我想复制前五列的CSV数据,同时还要指定colfcolg的值(每行对于colfcolg应该具有相同的值).

I would like to copy the CSV data for the first five columns, while also specifying values for colf and colg (each row should have the same value for colf and colg).

我可以将前五列复制到我的表中,如下所示:

I can copy the first five columns to my table like this:

conn = psycopg2.connect('dbname=name user=username')
cur = conn.cursor()
copy_sql = """
  copy my_table (cola, colb, colc, cold, cole)
  from stdin with
    csv
    header
    delimiter as ','
"""
from_csv = '/path/to/data.csv'
with open(from_csv, 'r') as f:
  cur.copy_expert(sql=copy_sql, file=f)
  conn.commit()
  cur.close()

我还如何使用python指定最后两列的值?我知道我可以在表DDL中指定默认值,如下所示:

How can I also specify values for the last two columns using python? I know that I can specify default values in the table DDL, like this:

create table my_table (
  cola          text,
  colb          text,
  colc          text,
  cold          text,
  cole          text,
  colf          text default 'foo',
  colg          text default 'bar'
)

但是我想使用python添加值,因为每个CSV上传的colfcolg都有自己的值,这些值由我的python代码中的逻辑决定.

But I'd like to add the values using python, since each CSV upload will have its own values for colf and colg, and these values are determined by logic in my python code.

推荐答案

好像有两种方法可以做到这一点,即先将所需的列添加到数据中,然后再上传更新的数据.

Looks like there are a couple of ways to do this by first adding the columns I need to the data, and then uploading the updated data.

使用petl软件包:

import psycopg2
from petl import fromcsv, addfield, todb

csv_file = '/path/to/data.csv'
table = fromcsv(csv_file)
table = addfield(table, 'colf', 'Some value')
table = addfield(table, 'colg', 'Another value')

conn = psycopg2.connect('dbname=test user=user')
todb(table, conn, 'my_table')

这在小数据上还可以,但是在大数据上却非常慢. psycopg2 copy_fromcopy_expert命令的执行速度似乎要快得多,因为它们使用了postgresql批量复制.通过首先将csv文件转换为pandas dataframe:

This works okay on small data, but it's incredibly slow on large data. The psycopg2 copy_from and copy_expert commands seem to go much faster since they make use of postgresql bulk copy. I was able to copy my data using copy_from by first converting my csv file into a pandas dataframe:

import psycopg2
import pandas as pd
from StringIO import StringIO

csv_file = '/path/to/file'
df = pd.read_csv(csv_file)
df['colf'] = 'My value'
df['colg'] = 'Foobar'

为了利用psycopg2 copy_命令,我需要使用read()readline()方法将dataframe转换为类似文件的对象,我可以使用StringIO :

In order to make use of the psycopg2 copy_ commands, I need to convert the dataframe into a file-like object with read() and readline() methods, which I can do using StringIO:

buf = StringIO()
df.to_csv(buf, header=False, index=False)
buf.pos = 0

请注意,您需要将缓冲区的pos设置为0,因为pandas.to_csv似乎默认情况下会将pos设置为末尾.有关说明,请参见此SO答案.

Note that you need to set the pos of the buffer to 0, because pandas.to_csv seems to set the pos to the end by default. See this SO answer for an explanation.

然后我可以复制该缓冲区对象:

Then I can copy that buffer object:

conn = psycopg2.connect('dbname=test user=user')
cur = conn.cursor()
cur.copy_from(buf, 'my_table', sep=',')
conn.commit()
cur.close()

这篇关于Psycopg2:将CSV数据复制到具有额外列值的表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆