将大桌子读到 pandas 中,有中间步骤吗? [英] Reading large tables into Pandas, is there a intermediate step?

查看:74
本文介绍了将大桌子读到 pandas 中,有中间步骤吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在整理一个数据分析脚本.该脚本从表连接到Teradata,Select *,并将其加载到pandas数据框中.

I have a data-analysis script that I am putting together. This script connects to Teradata, Select * from the table, and loads that into the pandas dataframe.

import teradata
import pandas as pd

with udaExec.connect(method="xxx", dsn="xxx", username="xxx", password="xxx") as session:

    query = "Select * from TableA"

    # read in records
    df = pd.read_sql(query, session)

    # misc pandas tests below...

这对于具有10万条或更少记录的表非常有用,但是问题是许多表的记录远远超过该记录(数百万条记录),而且它会无限期地运行.

This works great for tables with 100k records or less, but the problem is that many tables have far more records than that (millions and millions of records), and it just tends to run indefinitely.

我可以采取一些中间步骤吗?我一直在研究,我看到了一些有关将数据库表复制到.csv文件或.txt文件或其他东西的信息,然后再从中加载pandas数据框(而不是从表本身加载),但是我做不到理解它.

Is there some intermediate step I can take? I've been researching and I see something about copying the DB table to a .csv file or .txt file or something first, and then loading the pandas dataframe from that(instead of loading from the table itself), but I can't make sense of it.

任何建议将不胜感激!谢谢.

Any advice would be appreciated! Thanks.

推荐答案

在评论中,我答应提供一些代码,这些代码可以将服务器中的表快速读取到本地CSV文件中,然后将该CSV文件读取到Pandas数据框中.请注意,该代码是为postgresql编写的,但您可能很容易将其改编为其他数据库.

In a comment I promised to provide some code that can read a table from a server quickly into a local CSV file, then read that CSV file into a Pandas dataframe. Note that this code is written for postgresql, but you could probably adapt it pretty easily for other databases.

这是代码:

from cStringIO import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd

database = 'my_db'
pg_host = 'my_postgres_server'
table = 'my_table'
# note: you should also create a ~/.pgpass file with the credentials needed to access
# this server, e.g., a line like "*:*:*:username:password" (if you only access one server)

con = psycopg2.connect(database=database, host=pg_host)
cur = con.cursor()    

# Copy data from the database to a dataframe, using psycopg2 .copy_expert() function.
csv = StringIO()  # or tempfile.SpooledTemporaryFile()
# The next line is the right way to insert a table name into a query, but it requires 
# psycopg2 >= 2.7. See here for more details: https://stackoverflow.com/q/13793399/3830997
copy_query = sql.SQL("COPY {} TO STDOUT WITH CSV HEADER").format(sql.Identifier(table))
cur.copy_expert(copy_query, csv)
csv.seek(0)  # move back to start of csv data
df = pd.read_csv(csv)

以下是一些通过CSV路由大数据帧写入数据库的代码:

Here also is some code that writes large dataframes to the database via the CSV route:

csv = StringIO()
df.to_csv(csv, index=False, header=False)
csv.seek(0)
try:
    cur.copy_from(csv, table, sep=',', null='\\N', size=8192, columns=list(df.columns))
    con.commit()
except:
    con.rollback()
    raise

我在10 Mbps办公网络(不要问!)上测试了此代码,并具有70,000行表(CSV为5.3 MB).

I tested this code over my 10 Mbps office network (don't ask!) with a 70,000 row table (5.3 MB as a CSV).

从数据库中读取表时,我发现上面的代码比pandas.read_sql()快大约1/3(5.5s对8s).在大多数情况下,我不确定这是否会带来额外的复杂性.这可能与您获得的速度一样快-postgresql的COPY TO ...命令非常快,Pandas的read_csv也是如此.

When reading a table from the database, I found that the code above was about 1/3 faster than pandas.read_sql() (5.5s vs. 8s). I'm not sure that would justify the extra complexity in most cases. This is probably about as fast as you can get -- postgresql's COPY TO ... command is very fast, and so is Pandas' read_csv.

将数据帧写入数据库时​​,我发现使用CSV文件(上面的代码)比使用熊猫的df.to_sql()快大约50倍(5.8s对288s).这主要是因为Pandas不使用多行插入.多年来,这似乎一直是活跃讨论的主题-参见 https://github.com/pandas-dev/pandas/issues/8953 .

When writing a dataframe to the database, I found that using a CSV file (the code above) was about 50x faster than using pandas' df.to_sql() (5.8s vs 288s). This is mainly because Pandas doesn't use multi-row inserts. This seems to have been a subject of active discussion for several years -- see https://github.com/pandas-dev/pandas/issues/8953 .

关于chunksize的一些注意事项:这可能无法满足大多数用户的期望.如果在pandas.read_sql()中设置chunksize,查询仍将作为一个命令运行,但是结果将分批返回给程序.这是通过迭代器完成的,该迭代器依次产生每个块.如果在pandas.to_sql()中使用chunksize,则会导致插入操作成批完成,从而减少了内存需求.但是,至少在我的系统上,每个批处理仍细分为每行单独的insert语句,并且这些语句要花费 long 的时间来运行.

A couple of notes about chunksize: this may not do what most users expect. If you set chunksize in pandas.read_sql(), the query still runs as one command, but the results are returned to your program in batches; this is done with an iterator that yields each chunk in turn. If you use chunksize in pandas.to_sql(), it causes the inserts to be done in batches, reducing memory requirements. However, at least on my system, each batch is still broken down into individual insert statements for each row, and those take a long time to run.

还要注意:odo软件包看起来非常适合在数据框和任何数据库之间快速移动数据.我无法使其成功运行,但是您可能会遇到更好的运气.此处提供更多信息: http://odo.pydata.org/en/latest/overview. html

Also note: the odo package looks like it would be great for moving data quickly between a dataframe and any database. I couldn't get it to run successfully, but you may have better luck. More info here: http://odo.pydata.org/en/latest/overview.html

这篇关于将大桌子读到 pandas 中,有中间步骤吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆