pandas to_csv()缓慢保存大型数据框 [英] Pandas to_csv() slow saving large dataframe

查看:68
本文介绍了 pandas to_csv()缓慢保存大型数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我猜想这是一个简单的解决方法,但是我遇到了一个问题,即使用 to_csv()函数将熊猫数据帧保存到csv文件中需要将近一个小时.我正在将anaconda python 2.7.12与pandas(0.19.1)一起使用.

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

我已经确认硬件和内存可以正常工作,但是它们是相当宽的表(〜100列),其中大多数是数字(十进制)数据.

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

谢谢

推荐答案

您正在读取压缩文件并写入纯文本文件. 可能是IO瓶颈.

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

写入压缩文件可以将写入速度提高10倍

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

此外,您可以尝试使用不同的块大小和压缩方法("bz2","xz").

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

这篇关于 pandas to_csv()缓慢保存大型数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆