什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ? [英] What is the fastest way to upload a big csv file in notebook to work with python pandas?
问题描述
我正在尝试上传一个csv文件,大小为250MB。基本上是4百万行和6列的时间序列数据(1min)。通常的过程是:
location = r'C:\Users\Name\Folder_1\Folder_2\file .csv'
df = pd.read_csv(location)
此过程大约需要20分钟!!!。非常初步我已经探索了以下选项
-
撰写/保存
与未压缩的CSV文件相关的文件大小比率
RAW数据:
CSV:
在[68]:%timeit df.to_csv(fcsv)
1循环,最好的3:1min每循环9s
在[74]:%timeit pd .read_csv(fcsv)
1循环,最好的3:每个循环17.9 s
CSV .gzip:
在[70]:%timeit df.to_csv(fcsv_gz,compression ='gzip')
1循环,最好的3:3min每个循环6s
在[75]:%timeit pd.read_csv(fcsv_gz)
1循环,最好的3:每个循环18.9 s
Pickle:
In [66]:%timeit df.to_pickle(fpckl)
1循环,最好的3:每循环1.77 s
在[72]:%timeit pd.read_pickle )
10循环,最好的3:每个循环173 ms
HDF $ c> format ='fixed')[默认]:
时间df.to_hdf(fh5,'df')
1循环,最好的3:每循环2.03 s
在[73]:%timeit pd.read_hdf(fh5,'df' )
10循环,最好的3:每个循环196 ms
HDF $ c> format ='table'):
to_hdf('D:\\temp\\.data\\37010212_tab.h5','df',format ='t')
1循环,最佳3:每个循环2.6 s
在[38]:%timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5','df')
1循环,最佳3:每个循环230 ms
HDF(format = '[']',
:%timeit df.to_hdf('D:\\ temp \\.data\\37010212_tab_compress_zlib5.h5','df',format ='t',complevel = 5,complib ='zlib')
1 loop,best of 3:5.44 s per loops
在[41]:%timeit pd.read_hdf('D:\\temp\\.data\ \37010212_tab_compress_zlib5.h5','df')
1循环,最佳3:每循环854 ms
HDF(
format ='table',complib ='zlib',complevel = 9
):在[36]:%timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5','df',format ='t',complevel = 9,complib ='zlib')
1循环,最好的3:每个循环5.95 s
在[39]:%timeit pd.read_hdf D:\\temp\\.data\\\\ data\\37010212_tab_compress_zlib9.h5','df')
1个循环,3的最佳值:每个循环860 ms
HDF(
format ='table',complib ='bzip2',complevel = 5
):在[42]:%timeit df.to_hdf('D:\\temp \\。 data \\37010212_tab_compress_bzip2_l5.h5','df',format ='t',complevel = 5,complib ='bzip2')
1循环,最好的3:每个循环36.5 s
In [43]:%timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5','df')
1循环,最好的3:每个循环2.5 s
HDF(format ='table',complib ='bzip2',complevel = 9
):在[42]:%timeit df。 to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5','df',format ='t',complevel = 9,complib ='bzip2')
1循环,最好的3:每个循环36.5秒
在[43]:%timeit pd.read_hdf('D:\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ ','df')
1循环,最好的3:每循环2.5 s
PS我无法在我的 Windows 笔记本上测试
羽毛
DF信息: / p>
在[49]:df.shape
Out [49]:(4000000,6)
In [50]:df.info()
< class'pandas.core.frame.DataFrame'>
RangeIndex:4000000条目,0到3999999
数据列(共6列):
a datetime64 [ns]
b datetime64 [ns]
c datetime64 [ns]
d datetime64 [ns] $ b $是datetime64 [ns]
f datetime64 [ns]
dtypes:datetime64 [ns](6)
内存使用:183.1 MB
In [41]:df.head()
Out [41]:
abc \
0 1970-01-01 00:00:00 1970-01-01 00:01 :00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01 -01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06 :00
def
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00
文件大小:
{.data}»ls -lh 37010212 。* /d/temp/.data
-rw-r - r-- 1最大无492M 5月3日22:21 37010212.csv
-rw-r - r-- 1最大无23M 5月3日22:19 37010212.csv.gz
-rw-r - r-- 1最大无214M 5月3日22:02 37010212.h5
-rw-r - r-- 1最大无184M 5月3日22:02 37010212.pickle
-rw-r - r-- 1最大无215M 5月4日10:39 37010212_tab.h5
-rw-r - r-- 1最大无5.4M 5月4日10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r - r-- 1最大无5.4M 5月4日10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r - r- - 1最大无17M 5月4日10:42 37010212_tab_compress_zlib5.h5
-rw-r - r-- 1最大无17M 5月4日10:36 37010212_tab_compress_zlib9.h5
结论:
code>和
pandas.read_hdf.html> read_hdf()),你还可以存储你的数据压缩(HDF5
更快,但HDF5
更方便 - 您可以存储多个表/在内部,您可以有条件地读取您的数据(参见其中zlib
- 更快,bzip2
如果你可以建立/使用$ c> -
HDF5
和Pickle
PPS:不要使用Pickle作为大数据框架,因为最终可能会出现 SystemError:错误返回无异常设置错误消息。还介绍了此处和此处。
I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:
location = r'C:\Users\Name\Folder_1\Folder_2\file.csv' df = pd.read_csv(location)
This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options
I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.
解决方案Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
Comparison for the following storage formats: (
CSV
,CSV.gzip
,Pickle
,HDF5
[various compression]):read_s write_s size_ratio_to_CSV storage CSV 17.900 69.00 1.000 CSV.gzip 18.900 186.00 0.047 Pickle 0.173 1.77 0.374 HDF_fixed 0.196 2.03 0.435 HDF_tab 0.230 2.60 0.437 HDF_tab_zlib_c5 0.845 5.44 0.035 HDF_tab_zlib_c9 0.860 5.95 0.035 HDF_tab_bzip2_c5 2.500 36.50 0.011 HDF_tab_bzip2_c9 2.500 36.50 0.011
reading
writing/saving
file size ratio in relation to uncompressed CSV file
RAW DATA:
CSV:
In [68]: %timeit df.to_csv(fcsv) 1 loop, best of 3: 1min 9s per loop In [74]: %timeit pd.read_csv(fcsv) 1 loop, best of 3: 17.9 s per loop
CSV.gzip:
In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip') 1 loop, best of 3: 3min 6s per loop In [75]: %timeit pd.read_csv(fcsv_gz) 1 loop, best of 3: 18.9 s per loop
Pickle:
In [66]: %timeit df.to_pickle(fpckl) 1 loop, best of 3: 1.77 s per loop In [72]: %timeit pd.read_pickle(fpckl) 10 loops, best of 3: 173 ms per loop
HDF (
format='fixed'
) [Default]:In [67]: %timeit df.to_hdf(fh5, 'df') 1 loop, best of 3: 2.03 s per loop In [73]: %timeit pd.read_hdf(fh5, 'df') 10 loops, best of 3: 196 ms per loop
HDF (
format='table'
):In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t') 1 loop, best of 3: 2.6 s per loop In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df') 1 loop, best of 3: 230 ms per loop
HDF (
format='table', complib='zlib', complevel=5
):In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib') 1 loop, best of 3: 5.44 s per loop In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df') 1 loop, best of 3: 854 ms per loop
HDF (
format='table', complib='zlib', complevel=9
):In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib') 1 loop, best of 3: 5.95 s per loop In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df') 1 loop, best of 3: 860 ms per loop
HDF (
format='table', complib='bzip2', complevel=5
):In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2') 1 loop, best of 3: 36.5 s per loop In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df') 1 loop, best of 3: 2.5 s per loop
HDF (
format='table', complib='bzip2', complevel=9
):In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2') 1 loop, best of 3: 36.5 s per loop In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df') 1 loop, best of 3: 2.5 s per loop
PS i can't test
feather
on my Windows notebookDF info:
In [49]: df.shape Out[49]: (4000000, 6) In [50]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4000000 entries, 0 to 3999999 Data columns (total 6 columns): a datetime64[ns] b datetime64[ns] c datetime64[ns] d datetime64[ns] e datetime64[ns] f datetime64[ns] dtypes: datetime64[ns](6) memory usage: 183.1 MB In [41]: df.head() Out[41]: a b c \ 0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00 1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00 2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00 3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00 d e f 0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00 2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00 3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00 4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00
File sizes:
{ .data } » ls -lh 37010212.* /d/temp/.data -rw-r--r-- 1 Max None 492M May 3 22:21 37010212.csv -rw-r--r-- 1 Max None 23M May 3 22:19 37010212.csv.gz -rw-r--r-- 1 Max None 214M May 3 22:02 37010212.h5 -rw-r--r-- 1 Max None 184M May 3 22:02 37010212.pickle -rw-r--r-- 1 Max None 215M May 4 10:39 37010212_tab.h5 -rw-r--r-- 1 Max None 5.4M May 4 10:46 37010212_tab_compress_bzip2_l5.h5 -rw-r--r-- 1 Max None 5.4M May 4 10:51 37010212_tab_compress_bzip2_l9.h5 -rw-r--r-- 1 Max None 17M May 4 10:42 37010212_tab_compress_zlib5.h5 -rw-r--r-- 1 Max None 17M May 4 10:36 37010212_tab_compress_zlib9.h5
Conclusion:
Pickle
andHDF5
are much faster, butHDF5
is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look atwhere
parameter in read_hdf()), you can also store your data compressed (zlib
- is faster,bzip2
- provides better compression ratio), etc.PS if you can build/use
feather-format
- it should be even faster compared toHDF5
andPickle
PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.
这篇关于什么是最快的方式上传一个大的csv文件在笔记本中使用python pandas ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!