将pandas数据框转换为内存中的文件状对象? [英] Turn pandas dataframe into a file-like object in memory?
问题描述
我每天将大约2-250万条记录加载到Postgres数据库中.
I am loading about 2 - 2.5 million records into a Postgres database every day.
然后我使用pd.read_sql读取此数据以将其转换为数据框,然后进行一些列操作和一些小的合并.我将此修改后的数据保存为单独的表,供其他人使用.
I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.
当我执行pd.to_sql时,它将永远花费.如果我保存一个csv文件并在Postgres中使用COPY FROM,则整个过程仅需几分钟,但服务器位于单独的计算机上,因此很难在其中传输文件.
When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.
使用psycopg2,看来我可以使用copy_expert来受益于批量复制,但仍然可以使用python.如果可能,我想避免编写实际的csv文件.我可以在内存中使用pandas数据框吗?
Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?
这是我的熊猫代码示例.我想添加copy_expert或其他一些东西,以便在可能的情况下更快地保存此数据.
Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.
for date in required_date_range:
df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
...
do stuff to the columns
...
df.to_sql('table_name', pg_engine, index=False, if_exists='append', dtype=final_table_dtypes)
有人可以帮我提供示例代码吗?我宁愿仍然使用熊猫,最好在内存中使用它.如果没有,我将只写一个csv临时文件,然后这样做.
Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.
编辑-这是我最后的有效代码.每个日期只需要几百秒(几百万行),而不是几个小时.
Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.
to_sql ="使用CSV头从STDIN复制%s""
to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""
def process_file(conn, table_name, file_object):
fake_conn = cms_dtypes.pg_engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
fake_conn.commit()
fake_cur.close()
#after doing stuff to the dataframe
s_buf = io.StringIO()
df.to_csv(s_buf)
process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)
推荐答案
Python module io
(docs) has necessary tools for file-like objects.
import io
# text buffer
s_buf = io.StringIO()
# saving a data frame to a buffer (same as with a regular file):
df.to_csv(s_buf)
编辑. (我忘了)为了以后再从缓冲区读取,它的位置应该设置为开头:
Edit. (I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:
s_buf.seek(0)
我不熟悉psycopg2
,但根据 docs copy_expert
和copy_from
都可以使用,例如:
I'm not familiar with psycopg2
but according to docs both copy_expert
and copy_from
can be used, for example:
cur.copy_from(s_buf, table)
(对于Python 2,请参见 StringIO .)
(For Python 2, see StringIO.)
这篇关于将pandas数据框转换为内存中的文件状对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!