什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ？ [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?

查看：212 发布时间：2017/2/24 15:05:56 python csv pandas dataframe

本文介绍了什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试上传一个csv文件，大小为250MB。基本上是4百万行和6列的时间序列数据（1min）。通常的过程是：

  location = r'C：\Users\Name\Folder_1\Folder_2\file .csv'
 df = pd.read_csv（location）

此过程大约需要20分钟!!!。非常初步我已经探索了以下选项

撰写/保存

与未压缩的CSV文件相关的文件大小比率

RAW数据：

CSV：

 在[68]：％timeit df.to_csv（fcsv）
 1循环，最好的3：1min每循环9s 
 
在[74]：％timeit pd .read_csv（fcsv）
 1循环，最好的3：每个循环17.9 s

CSV .gzip：

 在[70]：％timeit df.to_csv（fcsv_gz，compression ='gzip'）
 1循环，最好的3：3min每个循环6s 
 
在[75]：％timeit pd.read_csv（fcsv_gz）
 1循环，最好的3：每个循环18.9 s

Pickle：

  In [66]：％timeit df.to_pickle（fpckl）
 1循环，最好的3：每循环1.77 s 
 
在[72]：％timeit pd.read_pickle ）
 10循环，最好的3：每个循环173 ms

HDF $ c> format ='fixed'）[默认]：

 时间df.to_hdf（fh5，'df'）
 1循环，最好的3：每循环2.03 s 
 
在[73]：％timeit pd.read_hdf（fh5，'df' ）
 10循环，最好的3：每个循环196 ms

HDF $ c> format ='table'）：

  to_hdf（'D：\\temp\\.data\\37010212_tab.h5'，'df'，format ='t'）
 1循环，最佳3：每个循环2.6 s 
 
在[38]：％timeit pd.read_hdf（'D：\\temp\\.data\\37010212_tab.h5'，'df'）
 1循环，最佳3：每个循环230 ms

HDF（ format = '[']'，

  ：％timeit df.to_hdf（'D：\\ temp \\.data\\37010212_tab_compress_zlib5.h5'，'df'，format ='t'，complevel = 5，complib ='zlib'） 
 1 loop，best of 3：5.44 s per loops 
 
在[41]：％timeit pd.read_hdf（'D：\\temp\\.data\ \37010212_tab_compress_zlib5.h5'，'df'）
 1循环，最佳3：每循环854 ms

HDF（ format ='table'，complib ='zlib'，complevel = 9 ）：

 在[36]：％timeit df.to_hdf（'D：\\temp\\.data\\37010212_tab_compress_zlib9.h5'，'df'，format ='t'，complevel = 9，complib ='zlib'）
 1循环，最好的3：每个循环5.95 s 
 
在[39]：％timeit pd.read_hdf D：\\temp\\.data\\\\ data\\37010212_tab_compress_zlib9.h5'，'df'）
 1个循环，3的最佳值：每个循环860 ms 
  
 
 
  HDF（ format ='table'，complib ='bzip2'，complevel = 5  ）：
 在[42]：％timeit df.to_hdf（'D：\\temp \\。 data \\37010212_tab_compress_bzip2_l5.h5'，'df'，format ='t'，complevel = 5，complib ='bzip2'）
 1循环，最好的3：每个循环36.5 s 
 
 In [43]：％timeit pd.read_hdf（'D：\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5'，'df'）
 1循环，最好的3：每个循环2.5 s 
  
 
 
 HDF（ format ='table'，complib ='bzip2'，complevel = 9 ）：
 在[42]：％timeit df。 to_hdf（'D：\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5'，'df'，format ='t'，complevel = 9，complib ='bzip2'）
 1循环，最好的3：每个循环36.5秒
 
在[43]：％timeit pd.read_hdf（'D：\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ '，'df'）
 1循环，最好的3：每循环2.5 s 
  
 PS我无法在我的 Windows 笔记本上测试羽毛 
 
 
 DF信息： / p> 
 
 
 在[49]：df.shape 
 Out [49]：（4000000，6）
 
 In [50]：df.info（）
< class'pandas.core.frame.DataFrame'> 
 RangeIndex：4000000条目，0到3999999 
数据列（共6列）：
a datetime64 [ns] 
b datetime64 [ns] 
c datetime64 [ns] 
d datetime64 [ns] $ b $是datetime64 [ns] 
f datetime64 [ns] 
 dtypes：datetime64 [ns]（6）
内存使用：183.1 MB 
 
 In [41]：df.head（）
 Out [41]：
abc \ 
 0 1970-01-01 00:00:00 1970-01-01 00:01 ：00 1970-01-01 00:02:00 
 1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00 
 2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00 
 3 1970-01-01 00:03:00 1970-01 -01 00:04:00 1970-01-01 00:05:00 
 4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06 ：00 
 
def 
 0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 
 1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00 
 2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00 
 3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00 
 4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00 
  
文件大小：
  {.data}»ls -lh 37010212 。* /d/temp/.data 
 -rw-r  -  r-- 1最大无492M 5月3日22:21 37010212.csv 
 -rw-r  -  r-- 1最大无23M 5月3日22:19 37010212.csv.gz 
 -rw-r  -  r-- 1最大无214M 5月3日22:02 37010212.h5 
 -rw-r  -  r-- 1最大无184M 5月3日22:02 37010212.pickle 
 -rw-r  -  r-- 1最大无215M 5月4日10:39 37010212_tab.h5 
 -rw-r  -  r-- 1最大无5.4M 5月4日10:46 37010212_tab_compress_bzip2_l5.h5 
 -rw-r  -  r-- 1最大无5.4M 5月4日10:51 37010212_tab_compress_bzip2_l9.h5 
 -rw-r  -  r- -  1最大无17M 5月4日10:42 37010212_tab_compress_zlib5.h5 
 -rw-r  -  r-- 1最大无17M 5月4日10:36 37010212_tab_compress_zlib9.h5 
  
 结论：
 
 
   code>和 HDF5 更快，但 HDF5 更方便 - 您可以存储多个表/在内部，您可以有条件地读取您的数据（参见其中 pandas.read_hdf.html> read_hdf（）），你还可以存储你的数据压缩（ zlib   - 更快， bzip2  
 
如果你可以建立/使用 $ c>  -   HDF5 和 Pickle  
 
 
   PPS：不要使用Pickle作为大数据框架，因为最终可能会出现 SystemError：错误返回无异常设置错误消息。还介绍了此处和此处。
 
I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:
location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
df = pd.read_csv(location)
This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options


Upload in chunks and then put the chunks together.
HDF5
'feather'
'pickle'


I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now. 
 解决方案 
Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):
                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011
reading



writing/saving



file size ratio in relation to uncompressed CSV file



RAW DATA:

CSV:
In [68]: %timeit df.to_csv(fcsv)
1 loop, best of 3: 1min 9s per loop

In [74]: %timeit pd.read_csv(fcsv)
1 loop, best of 3: 17.9 s per loop
CSV.gzip:
In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
1 loop, best of 3: 3min 6s per loop

In [75]: %timeit pd.read_csv(fcsv_gz)
1 loop, best of 3: 18.9 s per loop
Pickle:
In [66]: %timeit df.to_pickle(fpckl)
1 loop, best of 3: 1.77 s per loop

In [72]: %timeit pd.read_pickle(fpckl)
10 loops, best of 3: 173 ms per loop
HDF (format='fixed') [Default]:
In [67]: %timeit df.to_hdf(fh5, 'df')
1 loop, best of 3: 2.03 s per loop

In [73]: %timeit pd.read_hdf(fh5, 'df')
10 loops, best of 3: 196 ms per loop
HDF (format='table'):
In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t')
1 loop, best of 3: 2.6 s per loop

In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df')
1 loop, best of 3: 230 ms per loop
HDF (format='table', complib='zlib', complevel=5):
In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
1 loop, best of 3: 5.44 s per loop

In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df')
1 loop, best of 3: 854 ms per loop
HDF (format='table', complib='zlib', complevel=9):
In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
1 loop, best of 3: 5.95 s per loop

In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df')
1 loop, best of 3: 860 ms per loop
HDF (format='table', complib='bzip2', complevel=5):
In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df')
1 loop, best of 3: 2.5 s per loop
HDF (format='table', complib='bzip2', complevel=9):
In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df')
1 loop, best of 3: 2.5 s per loop
PS i can't test feather on my Windows notebook

DF info:
In [49]: df.shape
Out[49]: (4000000, 6)

In [50]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 6 columns):
a    datetime64[ns]
b    datetime64[ns]
c    datetime64[ns]
d    datetime64[ns]
e    datetime64[ns]
f    datetime64[ns]
dtypes: datetime64[ns](6)
memory usage: 183.1 MB

In [41]: df.head()
Out[41]:
                    a                   b                   c  \
0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00

                    d                   e                   f
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00
File sizes:
{ .data }  » ls -lh 37010212.*                                                                          /d/temp/.data
-rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
-rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
-rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
-rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
-rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
-rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5
Conclusion:

Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

                        这篇关于什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ？ [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ？ [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭