什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ? [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?

查看:212
本文介绍了什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试上传一个csv文件,大小为250MB。基本上是4百万行和6列的时间序列数据(1min)。通常的过程是:

  location = r'C:\Users\Name\Folder_1\Folder_2\file .csv'
df = pd.read_csv(location)

此过程大约需要20分钟!!!。非常初步我已经探索了以下选项






  • 撰写/保存





    与未压缩的CSV文件相关的文件大小比率





    RAW数据



    CSV:

     在[68]:%timeit df.to_csv(fcsv)
    1循环,最好的3:1min每循环9s

    在[74]:%timeit pd .read_csv(fcsv)
    1循环,最好的3:每个循环17.9 s

    CSV .gzip:

     在[70]:%timeit df.to_csv(fcsv_gz,compression ='gzip')
    1循环,最好的3:3min每个循环6s

    在[75]:%timeit pd.read_csv(fcsv_gz)
    1循环,最好的3:每个循环18.9 s

    Pickle:

      In [66]:%timeit df.to_pickle(fpckl)
    1循环,最好的3:每循环1.77 s

    在[72]:%timeit pd.read_pickle )
    10循环,最好的3:每个循环173 ms

    HDF $ c> format ='fixed')[默认]:

     时间df.to_hdf(fh5,'df')
    1循环,最好的3:每循环2.03 s

    在[73]:%timeit pd.read_hdf(fh5,'df' )
    10循环,最好的3:每个循环196 ms

    HDF $ c> format ='table'):

      to_hdf('D:\\temp\\.data\\37010212_tab.h5','df',format ='t')
    1循环,最佳3:每个循环2.6 s

    在[38]:%timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5','df')
    1循环,最佳3:每个循环230 ms



    HDF( format = '[']',

      :%timeit df.to_hdf('D:\\ temp \\.data\\37010212_tab_compress_zlib5.h5','df',format ='t',complevel = 5,complib ='zlib') 
    1 loop,best of 3:5.44 s per loops

    在[41]:%timeit pd.read_hdf('D:\\temp\\.data\ \37010212_tab_compress_zlib5.h5','df')
    1循环,最佳3:每循环854 ms

    HDF( format ='table',complib ='zlib',complevel = 9 ):

     在[36]:%timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5','df',format ='t',complevel = 9,complib ='zlib')
    1循环,最好的3:每个循环5.95 s

    在[39]:%timeit pd.read_hdf D:\\temp\\.data\\\\ data\\37010212_tab_compress_zlib9.h5','df')
    1个循环,3的最佳值:每个循环860 ms



    HDF( format ='table',complib ='bzip2',complevel = 5 ):

     在[42]:%timeit df.to_hdf('D:\\temp \\。 data \\37010212_tab_compress_bzip2_l5.h5','df',format ='t',complevel = 5,complib ='bzip2')
    1循环,最好的3:每个循环36.5 s

    In [43]:%timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5','df')
    1循环,最好的3:每个循环2.5 s



    HDF( format ='table',complib ='bzip2',complevel = 9 ):

     在[42]:%timeit df。 to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5','df',format ='t',complevel = 9,complib ='bzip2')
    1循环,最好的3:每个循环36.5秒

    在[43]:%timeit pd.read_hdf('D:\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ ','df')
    1循环,最好的3:每循环2.5 s

    PS我无法在我的 Windows 笔记本上测试羽毛

    DF信息: / p>

     在[49]:df.shape 
    Out [49]:(4000000,6)

    In [50]:df.info()
    < class'pandas.core.frame.DataFrame'>
    RangeIndex:4000000条目,0到3999999
    数据列(共6列):
    a datetime64 [ns]
    b datetime64 [ns]
    c datetime64 [ns]
    d datetime64 [ns] $ b $是datetime64 [ns]
    f datetime64 [ns]
    dtypes:datetime64 [ns](6)
    内存使用:183.1 MB

    In [41]:df.head()
    Out [41]:
    abc \
    0 1970-01-01 00:00:00 1970-01-01 00:01 :00 1970-01-01 00:02:00
    1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
    2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
    3 1970-01-01 00:03:00 1970-01 -01 00:04:00 1970-01-01 00:05:00
    4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06 :00

    def
    0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
    1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
    2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
    3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
    4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

    文件大小:

      {.data}»ls -lh 37010212 。* /d/temp/.data 
    -rw-r - r-- 1最大无492M 5月3日22:21 37010212.csv
    -rw-r - r-- 1最大无23M 5月3日22:19 37010212.csv.gz
    -rw-r - r-- 1最大无214M 5月3日22:02 37010212.h5
    -rw-r - r-- 1最大无184M 5月3日22:02 37010212.pickle
    -rw-r - r-- 1最大无215M 5月4日10:39 37010212_tab.h5
    -rw-r - r-- 1最大无5.4M 5月4日10:46 37010212_tab_compress_bzip2_l5.h5
    -rw-r - r-- 1最大无5.4M 5月4日10:51 37010212_tab_compress_bzip2_l9.h5
    -rw-r - r- - 1最大无17M 5月4日10:42 37010212_tab_compress_zlib5.h5
    -rw-r - r-- 1最大无17M 5月4日10:36 37010212_tab_compress_zlib9.h5

    结论



    code>和 HDF5 更快,但 HDF5 更方便 - 您可以存储多个表/在内部,您可以有条件地读取您的数据(参见其中 pandas.read_hdf.html> read_hdf()),你还可以存储你的数据压缩( zlib - 更快, bzip2

    如果你可以建立/使用 $ c> - HDF5 Pickle



    PPS:不要使用Pickle作为大数据框架,因为最终可能会出现 SystemError:错误返回无异常设置错误消息。还介绍了此处此处


    I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

    location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
    df = pd.read_csv(location)
    

    This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

    I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

    解决方案

    Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

    Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                      read_s  write_s  size_ratio_to_CSV
    storage
    CSV               17.900    69.00              1.000
    CSV.gzip          18.900   186.00              0.047
    Pickle             0.173     1.77              0.374
    HDF_fixed          0.196     2.03              0.435
    HDF_tab            0.230     2.60              0.437
    HDF_tab_zlib_c5    0.845     5.44              0.035
    HDF_tab_zlib_c9    0.860     5.95              0.035
    HDF_tab_bzip2_c5   2.500    36.50              0.011
    HDF_tab_bzip2_c9   2.500    36.50              0.011
    

    reading

    writing/saving

    file size ratio in relation to uncompressed CSV file

    RAW DATA:

    CSV:

    In [68]: %timeit df.to_csv(fcsv)
    1 loop, best of 3: 1min 9s per loop
    
    In [74]: %timeit pd.read_csv(fcsv)
    1 loop, best of 3: 17.9 s per loop
    

    CSV.gzip:

    In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
    1 loop, best of 3: 3min 6s per loop
    
    In [75]: %timeit pd.read_csv(fcsv_gz)
    1 loop, best of 3: 18.9 s per loop
    

    Pickle:

    In [66]: %timeit df.to_pickle(fpckl)
    1 loop, best of 3: 1.77 s per loop
    
    In [72]: %timeit pd.read_pickle(fpckl)
    10 loops, best of 3: 173 ms per loop
    

    HDF (format='fixed') [Default]:

    In [67]: %timeit df.to_hdf(fh5, 'df')
    1 loop, best of 3: 2.03 s per loop
    
    In [73]: %timeit pd.read_hdf(fh5, 'df')
    10 loops, best of 3: 196 ms per loop
    

    HDF (format='table'):

    In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t')
    1 loop, best of 3: 2.6 s per loop
    
    In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df')
    1 loop, best of 3: 230 ms per loop
    

    HDF (format='table', complib='zlib', complevel=5):

    In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
    1 loop, best of 3: 5.44 s per loop
    
    In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df')
    1 loop, best of 3: 854 ms per loop
    

    HDF (format='table', complib='zlib', complevel=9):

    In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
    1 loop, best of 3: 5.95 s per loop
    
    In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df')
    1 loop, best of 3: 860 ms per loop
    

    HDF (format='table', complib='bzip2', complevel=5):

    In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
    1 loop, best of 3: 36.5 s per loop
    
    In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df')
    1 loop, best of 3: 2.5 s per loop
    

    HDF (format='table', complib='bzip2', complevel=9):

    In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
    1 loop, best of 3: 36.5 s per loop
    
    In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df')
    1 loop, best of 3: 2.5 s per loop
    

    PS i can't test feather on my Windows notebook

    DF info:

    In [49]: df.shape
    Out[49]: (4000000, 6)
    
    In [50]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4000000 entries, 0 to 3999999
    Data columns (total 6 columns):
    a    datetime64[ns]
    b    datetime64[ns]
    c    datetime64[ns]
    d    datetime64[ns]
    e    datetime64[ns]
    f    datetime64[ns]
    dtypes: datetime64[ns](6)
    memory usage: 183.1 MB
    
    In [41]: df.head()
    Out[41]:
                        a                   b                   c  \
    0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
    1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
    2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
    3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
    4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
    
                        d                   e                   f
    0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
    1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
    2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
    3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
    4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00
    

    File sizes:

    { .data }  » ls -lh 37010212.*                                                                          /d/temp/.data
    -rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
    -rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
    -rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
    -rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
    -rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
    -rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
    -rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
    -rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
    -rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5
    

    Conclusion:

    Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

    PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

    PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

    这篇关于什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆