在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么? [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?

查看:25
本文介绍了在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试上传一个 250MB 的 csv 文件.基本上是 400 万行和 6 列的时间序列数据(1 分钟).通常的程序是:

location = r'C:UsersNameFolder_1Folder_2file.csv'df = pd.read_csv(位置)

此过程大约需要 20 分钟!!!.非常初步我已经探索了以下选项

  • 写入/保存

    与未压缩的 CSV 文件相关的文件大小比率

    原始数据:

    CSV:

    在 [68]: %timeit df.to_csv(fcsv)1 个循环,最好的 3 个:每个循环 1 分钟 9 秒在 [74]: %timeit pd.read_csv(fcsv)1 个循环,最好的 3 个:每个循环 17.9 秒

    CSV.gzip:

    在 [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')1 个循环,最好的 3 个:每个循环 3 分钟 6 秒在 [75]: %timeit pd.read_csv(fcsv_gz)1 个循环,最好的 3 个:每个循环 18.9 秒

    泡菜:

    在 [66]: %timeit df.to_pickle(fpckl)1 个循环,最好的 3 个:每个循环 1.77 秒在 [72]: %timeit pd.read_pickle(fpckl)10 个循环,最好的 3 个:每个循环 173 毫秒

    HDF (format='fixed') [默认]:

    在 [67]: %timeit df.to_hdf(fh5, 'df')1 个循环,最好的 3 个:每个循环 2.03 秒在 [73]: %timeit pd.read_hdf(fh5, 'df')10 个循环,最好的 3 个:每个循环 196 毫秒

    HDF(format='table'):

    在 [37]: %timeit df.to_hdf('D:\temp\.data\37010212_tab.h5', 'df', format='t')1 个循环,最好的 3 个:每个循环 2.6 秒在 [38]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab.h5', 'df')1 个循环,最好的 3 个:每个循环 230 毫秒

    HDF(format='table', complib='zlib', complevel=5):

    在 [40]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')1 个循环,最好的 3 个:每个循环 5.44 秒在 [41]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df')1 个循环,最好的 3 个:每个循环 854 毫秒

    HDF(format='table', complib='zlib', complevel=9):

    在 [36]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')1 个循环,最好的 3 个:每个循环 5.95 秒在 [39]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df')1 个循环,最好的 3 个:每个循环 860 毫秒

    HDF(format='table', complib='bzip2', complevel=5):

    在 [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')1 个循环,最好的 3 个:每个循环 36.5 秒在 [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df')1 个循环,最好的 3 个:每个循环 2.5 秒

    HDF(format='table', complib='bzip2', complevel=9):

    在 [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')1 个循环,最好的 3 个:每个循环 36.5 秒在 [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df')1 个循环,最好的 3 个:每个循环 2.5 秒

    PS 我无法在我的 Windows 笔记本上测试 feather

    DF 信息:

    在[49]中:df.shape出[49]:(4000000, 6)在 [50]: df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex:4000000 个条目,0 到 3999999数据列(共6列):日期时间 64[ns]b datetime64[ns]c datetime64[ns]d datetime64[ns]e datetime64[ns]f datetime64[ns]数据类型:datetime64[ns](6)内存使用:183.1 MB在 [41]: df.head()出[41]:a b c  1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:001 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:002 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:003 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:004 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00df0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:001 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:002 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:003 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:004 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

    文件大小:

    { .data } » ls -lh 37010212.*/d/temp/.data-rw-r--r-- 1 Max None 492M May 3 22:21 37010212.csv-rw-r--r-- 1 Max None 23M May 3 22:19 37010212.csv.gz-rw-r--r-- 1 Max None 214M May 3 22:02 37010212.h5-rw-r--r-- 1 Max None 184M May 3 22:02 37010212.pickle-rw-r--r-- 1 Max None 215M May 4 10:39 37010212_tab.h5-rw-r--r-- 1 Max None 5.4M May 4 10:46 37010212_tab_compress_bzip2_l5.h5-rw-r--r-- 1 Max None 5.4M May 4 10:51 37010212_tab_compress_bzip2_l9.h5-rw-r--r-- 1 Max None 17M May 4 10:42 37010212_tab_compress_zlib5.h5-rw-r--r-- 1 Max None 17M May 4 10:36 37010212_tab_compress_zlib9.h5

    结论:

    PickleHDF5 快得多,但 HDF5 更方便——你可以在里面存储多个表/框架,你可以读取你的有条件的数据(查看 中的 where 参数read_hdf()),您还可以存储压缩的数据(zlib - 更快,bzip2 - 提供更好的压缩率)等

    PS 如果您可以构建/使用 feather-format - 与 HDF5Pickle

    相比,它应该更快

    PPS:不要将 Pickle 用于大数据帧,因为您最终可能会遇到 SystemError: error return without exception set 错误信息.它也在此处此处.

    I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

    location = r'C:UsersNameFolder_1Folder_2file.csv'
    df = pd.read_csv(location)
    

    This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

    I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

    解决方案

    Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

    Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                      read_s  write_s  size_ratio_to_CSV
    storage
    CSV               17.900    69.00              1.000
    CSV.gzip          18.900   186.00              0.047
    Pickle             0.173     1.77              0.374
    HDF_fixed          0.196     2.03              0.435
    HDF_tab            0.230     2.60              0.437
    HDF_tab_zlib_c5    0.845     5.44              0.035
    HDF_tab_zlib_c9    0.860     5.95              0.035
    HDF_tab_bzip2_c5   2.500    36.50              0.011
    HDF_tab_bzip2_c9   2.500    36.50              0.011
    

    reading

    writing/saving

    file size ratio in relation to uncompressed CSV file

    RAW DATA:

    CSV:

    In [68]: %timeit df.to_csv(fcsv)
    1 loop, best of 3: 1min 9s per loop
    
    In [74]: %timeit pd.read_csv(fcsv)
    1 loop, best of 3: 17.9 s per loop
    

    CSV.gzip:

    In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
    1 loop, best of 3: 3min 6s per loop
    
    In [75]: %timeit pd.read_csv(fcsv_gz)
    1 loop, best of 3: 18.9 s per loop
    

    Pickle:

    In [66]: %timeit df.to_pickle(fpckl)
    1 loop, best of 3: 1.77 s per loop
    
    In [72]: %timeit pd.read_pickle(fpckl)
    10 loops, best of 3: 173 ms per loop
    

    HDF (format='fixed') [Default]:

    In [67]: %timeit df.to_hdf(fh5, 'df')
    1 loop, best of 3: 2.03 s per loop
    
    In [73]: %timeit pd.read_hdf(fh5, 'df')
    10 loops, best of 3: 196 ms per loop
    

    HDF (format='table'):

    In [37]: %timeit df.to_hdf('D:\temp\.data\37010212_tab.h5', 'df', format='t')
    1 loop, best of 3: 2.6 s per loop
    
    In [38]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab.h5', 'df')
    1 loop, best of 3: 230 ms per loop
    

    HDF (format='table', complib='zlib', complevel=5):

    In [40]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
    1 loop, best of 3: 5.44 s per loop
    
    In [41]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df')
    1 loop, best of 3: 854 ms per loop
    

    HDF (format='table', complib='zlib', complevel=9):

    In [36]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
    1 loop, best of 3: 5.95 s per loop
    
    In [39]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df')
    1 loop, best of 3: 860 ms per loop
    

    HDF (format='table', complib='bzip2', complevel=5):

    In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
    1 loop, best of 3: 36.5 s per loop
    
    In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df')
    1 loop, best of 3: 2.5 s per loop
    

    HDF (format='table', complib='bzip2', complevel=9):

    In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
    1 loop, best of 3: 36.5 s per loop
    
    In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df')
    1 loop, best of 3: 2.5 s per loop
    

    PS i can't test feather on my Windows notebook

    DF info:

    In [49]: df.shape
    Out[49]: (4000000, 6)
    
    In [50]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4000000 entries, 0 to 3999999
    Data columns (total 6 columns):
    a    datetime64[ns]
    b    datetime64[ns]
    c    datetime64[ns]
    d    datetime64[ns]
    e    datetime64[ns]
    f    datetime64[ns]
    dtypes: datetime64[ns](6)
    memory usage: 183.1 MB
    
    In [41]: df.head()
    Out[41]:
                        a                   b                   c  
    0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
    1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
    2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
    3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
    4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
    
                        d                   e                   f
    0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
    1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
    2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
    3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
    4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00
    

    File sizes:

    { .data }  » ls -lh 37010212.*                                                                          /d/temp/.data
    -rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
    -rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
    -rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
    -rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
    -rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
    -rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
    -rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
    -rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
    -rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5
    

    Conclusion:

    Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

    PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

    PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

    这篇关于在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆