性能:Python的大 pandas DataFrame.to_csv追加逐渐变得更慢 [英] Performance: Python pandas DataFrame.to_csv append becomes gradually slower

查看:6701
本文介绍了性能:Python的大 pandas DataFrame.to_csv追加逐渐变得更慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最初的问题:

我通过几千泡菜文件与Python熊猫DataFrames在它其中行的数目而变化,但不(近似600和1300之间)的collumns的数量(636是精确的)循环。然后,我改造他们(如出一辙tranformations每一个),并使用 DataFrame.to_csv()方法把它们添加到CSV文件。

to_csv code节选:

 如果picklefile =='0000.p:
    dftemp.to_csv(finalnormCSVFile)
其他:
    dftemp.to_csv(finalnormCSVFile,模式='A',标题= FALSE)

让我困扰的是,它开始了pretty快,但性能呈指数下降,我不停的处理时间日志:

 启动:2015年3月24日03:26:36.9580582015年3月24日03:26:36.958058
数= 0
时间:0:00:002015年3月24日03:30:​​53.254755
数= 100
时间:0:04:16.2966972015年3月24日03:39:16.149883
数= 200
时间:0:08:22.8951282015年3月24日03:51:12.247342
数= 300
时间:0:11:56.0974592015年3月24日04:06:45.099034
数= 400
时间:0:15:32.8516922015年3月24日04:26:09.411652
数= 500
时间:0:19:24.3126182015年3月24日04:49:14.519529
数= 600
时间:0:23:05.1078772015年3月24日05:16:30.175175
数= 700
时间:0:27:15.6556462015年3月24日05:47:04.792289
数= 800
时间:0:30:34.6171142015年3月24日06:21:35.137891
数= 900
时间:0:34:30.3456022015年3月24日06:59:53.313468
数= 1000
时间:0:38:18.1755772015年3月24日07:39:29.805270
数= 1100
时间:0:39:36.4918022015年3月24日08:20:30.852613
数= 1200
时间:0:41:01.0473432015年3月24日09:04:14.613948
数= 1300
时间:0:43:43.7613352015年3月24日09:51:45.502538
数= 1400
时间:0:47:30.8885902015年3月24日11:09:48.366950
数= 1500
时间:1:18:02.8644122015年3月24日13:02:33.152289
数= 1600
时间:1:52:44.7853392015年3月24日15:30:58.534493
数= 1700
时间:2:28:25.3822042015年3月24日18:09:40.391639
数= 1800
时间:2:38:41.8571462015年3月24日21:03:19.204587
数= 1900
时间:2:53:38.8129482015年3月25日00:00:05.855970
数= 2000
时间:2:56:46.6513832015年3月25日03:53:05.020944
数= 2100
时间:3:52:59.1649742015年3月25日05:02:16.534149
数= 2200
时间:1:09:11.5132052015年3月25日06:07:32.446801
数= 2300
时间:1:05:15.9126522015年3月25日07:13:45.075216
数= 2400
时间:1:06:12.6284152015年3月25日08:20:17.927286
数= 2500
时间:1:06:32.8520702015年3月25日09:27:20.676520
数= 2600
时间:1:07:02.7492342015年3月25日10:35:01.657199
数= 2700
时间:1:07:40.9806792015年3月25日11:43:20.788178
数= 2800
时间:1:08:19.1309792015年3月25日12:53:57.734390
数= 2900
时间:1:10:3​​6.9462122015年3月25日14:07:20.936314
数= 3000
时间:1:13:23.2019242015年3月25日15:22:47.076786
数= 3100
时间:1:15:26.1404722015年3月25日19:51:10.776342
数= 3200
时间:4:28:23.6995562015年3月26日03:06:47.372698
数= 3300
时间:7:15:36.596356数= 3324
循环结束:2015年3月26日03:59:54.161842结束:2015年3月26日03:59:54.161842
总工期:2日,0:33:17.203784

更新#1:

我确实如你所说@Alexander但它有一定的 to_csv做() mehod:

 启动:2015年3月26日05:18:25.9484102015年3月26日05:18:25.948410
数= 0
时间:0:00:002015年3月26日05:20:30.425041
数= 100
时间:0:02:04.4766312015年3月26日05:22:27.680582
数= 200
时间:0:01:57.2555412015年3月26日05:24:26.012598
数= 300
时间:0:01:58.3320162015年3月26日05:26:16.542835
数= 400
时间:0:01:50.5302372015年3月26日05:27:58.063196
数= 500
时间:0:01:41.5203​​612015年3月26日05:29:45.769580
数= 600
时间:0:01:47.7063842015年3月26日05:31:44.537213
数= 700
时间:0:01:58.7676332015年3月26日05:33:41.591837
数= 800
时间:0:01:57.0546242015年3月26日05:35:43.963843
数= 900
时间:0:02:02.3720062015年3月26日05:37:46.171643
数= 1000
时间:0:02:02.2078002015年3月26日05:38:36.493399
数= 1100
时间0:00:50.3217562015年3月26日05:39:42.123395
数= 1200
时间:0:01:05.6299962015年3月26日05:41:13.122048
数= 1300
时间:0:01:30.9986532015年3月26日05:42:41.885513
数= 1400
时间:0:01:28.7634652015年3月26日05:44:20.937519
数= 1500
时间:0:01:39.0520062015年3月26日05:46:16.012842
数= 1600
时间:0:01:55.0753232015年3月26日05:48:14.727444
数= 1700
时间:0:01:58.7146022015年3月26日05:50:​​15.792909
数= 1800
时间:0:02:01.0654652015年3月26日05:51:48.228601
数= 1900
时间:0:01:32.4356922015年3月26日05:52:22.755937
数= 2000
时间0:00:34.5273362015年3月26日05:52:58.289474
数= 2100
时间0:00:35.5335372015年3月26日05:53:39.406794
数= 2200
时间0:00:41.1173202015年3月26日05:54:11.348939
数= 2300
时间0:00:31.9421452015年3月26日05:54:43.057281
数= 2400
时间0:00:31.7083422015年3月26日05:55:19.483600
数= 2500
时间0:00:36.4263192015年3月26日05:55:52.216424
数= 2600
时间0:00:32.7328242015年3月26日05:56:27.409991
数= 2700
时间0:00:35.1935672015年3月26日05:57:00.810139
数= 2800
时间0:00:33.4001482015年3月26日05:58:17.109425
数= 2900
时间:0:01:16.2992862015年3月26日05:59:31.021719
数= 3000
时间:0:01:13.9122942015年3月26日06:00:49.200303
数= 3100
时间:0:01:18.1785842015年3月26日06:02:07.732028
数= 3200
时间:0:01:18.5317252015年3月26日06:03:28.518541
数= 3300
时间:0:01:20.786513数= 3324
循环结束:2015年3月26日06:03:47.321182结束:2015年3月26日06:03:47.321182
总时间:0:45:21.372772

和的要求,源$ C ​​$ C:

 进口泡菜
进口大熊猫作为PD
导入numpy的是NP
从OS进口listdir同时
从os.path中进口ISFILE,加入
从日期时间日期时间进口#定义功能深层副本大熊猫数据帧:
高清very_deep_copy(个体经营):
    返回pd.DataFrame(self.values​​.copy(),self.index.copy(),self.columns.copy())#添加功能模块的数据框:
pd.DataFrame.very_deep_copy = very_deep_copy#定义数据帧头:
头= [
    ConcatIndex','连接字符串指数','写到FileID',...,'属性<自动保存>','属性与LT;&BGCOLOR GT;'
    ]
排除= [
    ConcatIndex','连接字符串指数','写到FileID',...,'真正的URL阵
    ]PATH =./dataset_final/
pickleFiles = [F在listdir同时(路径)f若ISFILE(加入(路径,F))]
finalnormCSVFile ='finalNormalizedDataFrame2.csv数= 0
START_TIME = datetime.now()
T1 = START_TIME
打印(开始:+ STR(START_TIME)+\\ n)
在pickleFiles picklefile:
    如果count%100 == 0:
        T2 = datetime.now()
        打印(STR(T2))
        打印('计数='+ STR(计数))
        打印(时间:+ STR(T2 - T1)+'\\ n')
        T1 = T2    #DataFrame操作:
    DF = pd.read_pickle(路径+ picklefile)    DF ['ConcatIndex'] = 100000 * df.​​FileID + df.ID
    因为我在范围(0,LEN(DF)):
        df.loc [我,'连接字符串指数'] = STR(DF ['ConcatIndex'] [I])。zfill(10)
    df.index = df.ConcatIndex
    #DataFrame规范化:
    dftemp = df.very_deep_copy()
    在头字符串:
        如果字符串中排除:
            如果string ='ConcatIndex'!
                dftemp.drop(字符串,轴= 1,就地= TRUE)
        其他:
            如果真正的字符串:
                最大= pd.DataFrame.max(DF [string.strip('真正')])
            ELIF字符串'孩子':
                最大= pd.DataFrame.max(DF [string.strip('孩子')+'说明'])
            其他:
                最大= pd.DataFrame.max(DF [字符串])            如果最大值= 0!
                dftemp [字符串] = dftemp [字符串] /最大    dftemp.drop('ConcatIndex',轴= 1,就地= TRUE)    #saving将数据帧中的CSV:
    如果picklefile =='0000.p:
        dftemp.to_csv(finalnormCSVFile)
    其他:
        dftemp.to_csv(finalnormCSVFile,模式='A',标题= FALSE)    数+ = 1打印('计数='+ STR(计数))
cycle_end_time = datetime.now()
打印(循环结束:+ STR(cycle_end_time)+\\ n)END_TIME = datetime.now()
打印(结束:+ STR(END_TIME))
打印(总持续时间:'+ STR(END_TIME - START_TIME)+'\\ n')

更新#2:

至于建议我执行的命令%PRUN%运行./DataSetNormalization.py第一个几百picklefiles和结果如下:

  136373640函数调用(136342619原始呼叫)在1018.769秒   内部时间:通过有序   ncalls tottime percall cumtime percall文件名:LINENO(功能)
      220 667.069 3.032 667.069 3.032 {方法'关闭''_io.TextIOWrapper对象}
     1540 42.046 0.027 46.341 0.030 {} pandas.lib.write_csv_rows
      219 34.886 0.159 34.886 0.159 {内置方法收集}
     3520 16.782 0.005 16.782 0.005 {} pandas.algos.take_2d_axis1_object_object
    78323 9.948 0.000 9.948 0.000 {内置的方法空}
 25336892 9.645 0.000 12.635 0.000 {内置方法isinstance}
  1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)
二十二万○八百三十一分之二十二万一千○五十一7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)
   723540 7.312 0.000 7.312 0.000 {法numpy.ufunc''减少'对象}
   273414 7.137 0.000 20.642 0.000 internals.py:2656(set)
   604245 6.846 0.000 6.850 0.000 {'numpy.ndarray对象的方法复制}
     1760 6.566 0.004 6.566 0.004 {} pandas.lib.isnullobj
   276274 5.315 0.000 5.315 0.000 {'numpy.ndarray对象的方法拉威尔}
  1719244 5.264 0.000 5.266 0.000 {内置的法阵}
  1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)
  1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)
        1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
   602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)
   441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)
   272754 4.550 0.000 4.550 0.000 internals.py:1291(set)
   220883 4.073 0.000 4.073 0.000 {内置方法charmap_en code}
  4781222 3.805 0.000 4.349 0.000 {内置方法GETATTR}
    52143 3.673 0.000 3.673 0.000 {内置方法truediv}
  1920486 3.671 0.000 3.672 0.000 {法pandas.index.IndexEngine'对象'get_loc'}
  1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)
   875899 3.508 0.000 14.458 0.000 series.py:134(__init__)
   334357 3.420 0.000 3.439 0.000 {} pandas.lib.infer_dtype
  2581268 3.419 0.000 4.774 0.000 {} pandas.lib.values​​_from_object
  1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)
   824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)
  2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)
   273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)
  1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)
   879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)
   552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)
   824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)
   220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)
   326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)
   602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values​​)
   602485 2.087 0.000 23.202 0.000 internals.py:2558(get)
   770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)
   276494 1.966 0.000 1.966 0.000 {} pandas.lib.get_blkno_indexers
一千零八十七万三千零七十六分之一千零九十零万三千八百七十六1.935 0.000 1.972 0.000 {内置方法LEN}
   220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)
      220 1.893 0.009 1.995 0.009 {内置方法负荷}
  1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)
   112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)
   602485 1.707 0.000 8.903 0.000 series.py:238(from_array)
   875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)
     3300 1.661 0.001 1.661 0.001 {法了ToList'numpy.ndarray对象}
  1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)
  4211850 1.593 0.000 1.593 0.000 {内置方法issubclass}
  1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)
   273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)
   441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)
   220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)
   273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)
   326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)
   274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)
   273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)
   602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)
   934422 1.175 0.000 1.894 0.000 {法numpy.ndarray'objects视图}
     1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)
   220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)
   441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)
   220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)
  1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)
   602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values​​)
      220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)
   824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)
  1491578 1.022 0.000 2.990 0.000 common.py:58(_check)
   782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)
   290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)
   220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)
   273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)
   220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)
   220611 0.914 0.000 0.914 0.000 {} pandas.lib.is_bool_array
   498646 0.906 0.000 0.906 0.000 {法明确'字典'的对象}
   715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)
   452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)
   602485 0.801 0.000 0.801 0.000 internals.py:208(iget)
    52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)
   606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)
   708971 0.732 0.000 2.156 0.000 internals.py:3165(values​​)
  1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)
   109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values​​)
   220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)
   924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)
  1515796 0.698 0.000 0.868 0.000 {内置方法hasattr}
   220831 0.670 0.000 4.299 0.000 internals.py:435(copy)
   875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)
   220831 0.648 0.000 0.649 0.000 {法pandas.index.IndexEngine'对象'的get_value'}
   452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)
   715345 0.634 0.000 1.886 0.000 {} pandas.lib.isscalar
     1980年0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)
   220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)
   272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)
   220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)
   498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)
  1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)
   220831 0.573 0.000 1.883 0.000 index.py:222(view)
   814797 0.555 0.000 0.905 0.000 internals.py:3086(_values​​)
    52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)
   220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)
   228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)
   769651 0.528 0.000 0.528 0.000 {内置方法分钟}
   224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)
   ...

我将重新开始进行确认,但看起来它肯定有事情做与大熊猫 to_csv()方法,因为大多数的运行时间是在IO使用,该CSV作家。为什么有这种效果呢?有什么建议?

更新3:

好吧,我也花了上使用的'{方法关闭时一个完整的%PRUN 测试确实近90% _io.TextIOWrapper对象} 。所以我想这里有一个问题......你们觉得呢?

我在这里的问题是:


  1. 在这里起源是什么性能?减少

  2. 确实 pandas.DataFrames.to_csv()追加模式下,每个写入到它时加载整个文件?

  3. 有没有办法提升的过程?


解决方案

在这样的情况,你应该个人资料您的code (以查看哪些函数调用花费大部分时间)这样你可以凭经验检查它是否在 read_csv ,而不是其他地方...

确实慢

从看你的code:首先有很多在这里复制和大量循环的(没有足够的矢量)......每次你看到循环寻找一种方式将其删除。其次,当你使用的东西像zfill,我想知道,如果你想 to_fwf (固定宽度格式),而不是 to_csv

一些理智的测试:是否有些文件是比别人更大的显著(这可能导致你打交换)?你确定最大的文件是只有1200行?有你选中此?例如使用 WC -l <​​/ code>。

IMO我认为这不太可能是垃圾收集。(如建议在其他的答案)。


下面是您code一些改进,应该提高的运行时间。

列是固定的我会提取列计算和矢量化的实际,孩子和其他的标准化。使用应用而非迭代(对于zfill)。

  columns_to_drop =设置(头)及集(排除)#也许还 -  ['ConcatIndex']
remaining_cols =集(头) - 集(排除)
real_cols = [R为R IN remaining_cols在R如果'真实']
real_cols_suffix = [r.strip('真正'),使其在真实R]
remaining_cols = remaining_cols - real_cols
child_cols = [在remaining_cols r代表R如果在R'孩子']
child_cols_desc = [r.strip('孩子'+'说明'),使其在真实R]
remaining_cols = remaining_cols - child_cols为计数,picklefile在历数(pickleFiles):
    如果count%100 == 0:
        T2 = datetime.now()
        打印(STR(T2))
        打印('计数='+ STR(计数))
        打印(时间:+ STR(T2 - T1)+'\\ n')
        T1 = T2    #DataFrame操作:
    DF = pd.read_pickle(路径+ picklefile)    DF ['ConcatIndex'] = 100000 * df.​​FileID + df.ID
    #使用适用于这里,而不是迭代
    DF ['连接字符串指数'] = DF ['ConcatIndex']申请(波长X:STR(X).zfill(10))。
    df.index = df.ConcatIndex    #DataFrame规范化:
    dftemp = df.very_deep_copy()#不* *认为你需要这个    #删除所有的排除
    dftemp.drop(columns_to_drop),轴= 1,就地= TRUE)    #真正的正常化COLS
    M = dftemp [real_cols_suffix]的.max()
    m.index = real_cols
    dftemp [real_cols] = dftemp [real_cols] /平方米    #正常化孩子COLS
    M = dftemp [child_cols_desc]的.max()
    m.index = child_cols
    dftemp [child_cols] = dftemp [child_cols] /平方米    #剩余正常化
    剩余=名单(剩余 - 子)
    dftemp [剩余] = dftemp [剩余] / dftemp [剩余]的.max()    #如果这种情况是很重要的则丢弃m的行用的.max()为0
    #如果最大值= 0!
    #dftemp [字符串] = dftemp [字符串] /最大    #这是同期下降,如果你需要它,然后从columns_to_drop减去['ConcatIndex']
    #dftemp.drop('ConcatIndex',轴= 1,就地= TRUE)    #saving将数据帧中的CSV:
    如果picklefile =='0000.p:
        dftemp.to_csv(finalnormCSVFile)
    其他:
        dftemp.to_csv(finalnormCSVFile,模式='A',标题= FALSE)

作为一个风格一点,我可能会选择来包装这些部件进入功能,这也将意味着,如果真的是问题的更多东西可以gc'd ...


这将是另一种速度更快的选择是,如果你并不需要产生的输出为CSV(但我希望你这样做)使用pytables(HDF5Store)...

最好的事情做的目前是分析您的code。例如与%PRUN 在IPython中如看到 http://pynash.org/2013/03/06/timing-和profiling.html 。然后,你可以看到它绝对是 read_csv 和具体在哪里(这您code线和大熊猫code线)。


啊哈,我错过了你追加所有这些到的的csv文件。而在你PRUN它显示的大部分时间在关闭花了,所以让我们保持文件打开:

 #外的for循环(所以打开文件只有一次休息)
F =开放(finalnormCSVFile,'W')...
对于picklefile在...    如果picklefile =='0000.p:
        dftemp.to_csv(F)
    其他:
        dftemp.to_csv(F,模式='A',标题= FALSE)
...f.close()

文件被打开,然后才能追加到每一次,它需要在写之前寻求到最后,也可能是,这是昂贵的(我不明白为什么这应该是的坏,但保持它打开就不再需要做到这一点)。

Initial Question:

I'm looping through a couple of thousand pickle files with Python Pandas DataFrames in it which vary in the number of rows (between aprox. 600 and 1300) but not in the number of collumns (636 to be exact). Then I transform them (exactly the same tranformations to each) and append them to a csv file using the DataFrame.to_csv() method.

The to_csv code excerpt:

if picklefile == '0000.p':
    dftemp.to_csv(finalnormCSVFile)
else:
    dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

What bothers me is that it starts off pretty fast but performance decreases exponentially, I kept a processing time log:

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058
count = 0
time: 0:00:00

2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697

2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128

2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459

2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692

2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618

2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877

2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646

2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114

2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602

2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577

2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802

2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343

2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335

2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590

2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412

2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339

2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204

2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146

2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948

2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383

2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974

2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205

2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652

2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415

2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070

2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234

2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679

2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979

2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212

2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924

2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472

2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556

2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356

count = 3324
end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784

Update #1:

I did as you suggested @Alexander but it has certainly to do with the to_csv() mehod:

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410
count = 0
time: 0:00:00

2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631

2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541

2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016

2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237

2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361

2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384

2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633

2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624

2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006

2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800

2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756

2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996

2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653

2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465

2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006

2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323

2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602

2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465

2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692

2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336

2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537

2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320

2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145

2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342

2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319

2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824

2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567

2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148

2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286

2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294

2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584

2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725

2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513

count = 3324
end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772

And as requested, the source code:

import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime

# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
    return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:    
pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:
head = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
    ]
exclude = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
    ]

path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'

count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")


for picklefile in pickleFiles: 
    if count%100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    for i in range(0, len(df)):
        df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
    df.index = df.ConcatIndex


    #DataFrame Normalization:
    dftemp = df.very_deep_copy()
    for string in head:
        if string in exclude:
            if string != 'ConcatIndex':
                dftemp.drop(string, axis=1, inplace=True)
        else:
            if 'Real ' in string:
                max = pd.DataFrame.max(df[string.strip('Real ')])
            elif 'child' in string:
                max = pd.DataFrame.max(df[string.strip('child')+'desc'])
            else:
                max = pd.DataFrame.max(df[string])

            if max != 0:
                dftemp[string] = dftemp[string]/max

    dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

    count += 1

print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")

end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')

Update #2:

As suggested I executed the command %prun %run "./DataSetNormalization.py" for the first couple of hundred picklefiles and the result is as followed:

   136373640 function calls (136342619 primitive calls) in 1018.769 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220  667.069    3.032  667.069    3.032 {method 'close' of '_io.TextIOWrapper' objects}
     1540   42.046    0.027   46.341    0.030 {pandas.lib.write_csv_rows}
      219   34.886    0.159   34.886    0.159 {built-in method collect}
     3520   16.782    0.005   16.782    0.005 {pandas.algos.take_2d_axis1_object_object}
    78323    9.948    0.000    9.948    0.000 {built-in method empty}
 25336892    9.645    0.000   12.635    0.000 {built-in method isinstance}
  1433941    9.344    0.000    9.363    0.000 generic.py:1845(__setattr__)
221051/220831    7.387    0.000  119.767    0.001 indexing.py:194(_setitem_with_indexer)
   723540    7.312    0.000    7.312    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   273414    7.137    0.000   20.642    0.000 internals.py:2656(set)
   604245    6.846    0.000    6.850    0.000 {method 'copy' of 'numpy.ndarray' objects}
     1760    6.566    0.004    6.566    0.004 {pandas.lib.isnullobj}
   276274    5.315    0.000    5.315    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1719244    5.264    0.000    5.266    0.000 {built-in method array}
  1102450    5.070    0.000   29.543    0.000 internals.py:1804(make_block)
  1045687    5.056    0.000   10.209    0.000 index.py:709(__getitem__)
        1    4.718    4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
   602485    4.575    0.000   15.087    0.000 internals.py:2586(iget)
   441662    4.562    0.000   33.386    0.000 internals.py:2129(apply)
   272754    4.550    0.000    4.550    0.000 internals.py:1291(set)
   220883    4.073    0.000    4.073    0.000 {built-in method charmap_encode}
  4781222    3.805    0.000    4.349    0.000 {built-in method getattr}
    52143    3.673    0.000    3.673    0.000 {built-in method truediv}
  1920486    3.671    0.000    3.672    0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
  1096730    3.513    0.000    8.370    0.000 internals.py:3035(__init__)
   875899    3.508    0.000   14.458    0.000 series.py:134(__init__)
   334357    3.420    0.000    3.439    0.000 {pandas.lib.infer_dtype}
  2581268    3.419    0.000    4.774    0.000 {pandas.lib.values_from_object}
  1102450    3.036    0.000    6.110    0.000 internals.py:59(__init__)
   824856    2.888    0.000   45.749    0.000 generic.py:1047(_get_item_cache)
  2424185    2.657    0.000    3.870    0.000 numeric.py:1910(isscalar)
   273414    2.505    0.000    9.332    0.000 frame.py:2113(_sanitize_column)
  1646198    2.491    0.000    2.880    0.000 index.py:698(__contains__)
   879639    2.461    0.000    2.461    0.000 generic.py:87(__init__)
   552988    2.385    0.000    4.451    0.000 internals.py:3565(_get_blkno_placements)
   824856    2.349    0.000   51.282    0.000 frame.py:1655(__getitem__)
   220831    2.224    0.000   21.670    0.000 internals.py:460(setitem)
   326437    2.183    0.000   11.352    0.000 common.py:1862(_possibly_infer_to_datetimelike)
   602485    2.167    0.000   16.974    0.000 frame.py:1982(_box_item_values)
   602485    2.087    0.000   23.202    0.000 internals.py:2558(get)
   770739    2.036    0.000    6.471    0.000 internals.py:1238(__init__)
   276494    1.966    0.000    1.966    0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076    1.935    0.000    1.972    0.000 {built-in method len}
   220831    1.924    0.000   76.647    0.000 indexing.py:372(setter)
      220    1.893    0.009    1.995    0.009 {built-in method load}
  1920486    1.855    0.000    8.198    0.000 index.py:1173(get_loc)
   112860    1.828    0.000    9.607    0.000 common.py:202(_isnull_ndarraylike)
   602485    1.707    0.000    8.903    0.000 series.py:238(from_array)
   875899    1.688    0.000    2.493    0.000 series.py:263(_set_axis)
     3300    1.661    0.001    1.661    0.001 {method 'tolist' of 'numpy.ndarray' objects}
  1102670    1.609    0.000    2.024    0.000 internals.py:108(mgr_locs)
  4211850    1.593    0.000    1.593    0.000 {built-in method issubclass}
  1335546    1.501    0.000    2.253    0.000 generic.py:297(_get_axis_name)
   273414    1.411    0.000   37.866    0.000 frame.py:1994(__setitem__)
   441662    1.356    0.000    7.884    0.000 indexing.py:982(_convert_to_indexer)
   220831    1.349    0.000  131.331    0.001 indexing.py:95(__setitem__)
   273414    1.329    0.000   23.170    0.000 generic.py:1138(_set_item)
   326437    1.276    0.000    6.203    0.000 fromnumeric.py:2259(prod)
   274734    1.271    0.000    2.113    0.000 shape_base.py:60(atleast_2d)
   273414    1.242    0.000   34.396    0.000 frame.py:2072(_set_item)
   602485    1.183    0.000    1.979    0.000 generic.py:1061(_set_as_cached)
   934422    1.175    0.000    1.894    0.000 {method 'view' of 'numpy.ndarray'objects}
     1540    1.144    0.001   58.217    0.038 format.py:1409(_save_chunk)
   220831    1.144    0.000    9.198    0.000 indexing.py:139(_convert_tuple)
   441662    1.137    0.000    3.036    0.000 indexing.py:154(_convert_scalar_indexer)
   220831    1.087    0.000    1.281    0.000 arrayprint.py:343(array2string)
  1332026    1.056    0.000    3.997    0.000 generic.py:310(_get_axis)
   602485    1.046    0.000    9.949    0.000 frame.py:1989(_box_col_values)
      220    1.029    0.005    1.644    0.007 internals.py:2429(_interleave)
   824856    1.025    0.000   46.777    0.000 frame.py:1680(_getitem_column)
  1491578    1.022    0.000    2.990    0.000 common.py:58(_check)
   782616    1.010    0.000    3.513    0.000 numeric.py:394(asarray)
   290354    0.988    0.000    1.386    0.000 internals.py:1950(shape)
   220831    0.958    0.000   15.392    0.000 generic.py:2101(copy)
   273414    0.940    0.000    1.796    0.000 indexing.py:1520(_convert_to_index_sliceable)
   220831    0.920    0.000    1.558    0.000 common.py:1110(_possibly_downcast_to_dtype)
   220611    0.914    0.000    0.914    0.000 {pandas.lib.is_bool_array}
   498646    0.906    0.000    0.906    0.000 {method 'clear' of 'dict' objects}
   715345    0.848    0.000   13.083    0.000 common.py:132(_isnull_new)
   452882    0.824    0.000    1.653    0.000 index.py:256(__array_finalize__)
   602485    0.801    0.000    0.801    0.000 internals.py:208(iget)
    52583    0.748    0.000    2.038    0.000 common.py:1223(_fill_zeros)
   606005    0.736    0.000    6.755    0.000 internals.py:95(make_block_same_class)
   708971    0.732    0.000    2.156    0.000 internals.py:3165(values)
  1760378    0.724    0.000    0.724    0.000 internals.py:2025(_get_items)
   109560    0.720    0.000    6.140    0.000 nanops.py:152(_get_values)
   220831    0.718    0.000   11.017    0.000 internals.py:2395(copy)
   924669    0.712    0.000    1.298    0.000 common.py:2248(_get_dtype_type)
  1515796    0.698    0.000    0.868    0.000 {built-in method hasattr}
   220831    0.670    0.000    4.299    0.000 internals.py:435(copy)
   875899    0.661    0.000    0.661    0.000 series.py:285(_set_subtyp)
   220831    0.648    0.000    0.649    0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
   452882    0.640    0.000    0.640    0.000 index.py:218(_reset_identity)
   715345    0.634    0.000    1.886    0.000 {pandas.lib.isscalar}
     1980    0.626    0.000    1.172    0.001 internals.py:3497(_merge_blocks)
   220831    0.620    0.000    2.635    0.000 common.py:1933(_is_bool_indexer)
   272754    0.608    0.000    0.899    0.000 internals.py:1338(should_store)
   220831    0.599    0.000    3.463    0.000 series.py:482(__getitem__)
   498645    0.591    0.000    1.497    0.000 generic.py:1122(_clear_item_cache)
  1119390    0.584    0.000    1.171    0.000 index.py:3936(_ensure_index)
   220831    0.573    0.000    1.883    0.000 index.py:222(view)
   814797    0.555    0.000    0.905    0.000 internals.py:3086(_values)
    52583    0.543    0.000   15.545    0.000 ops.py:469(wrapper)
   220831    0.536    0.000    3.760    0.000 internals.py:371(_try_cast_result)
   228971    0.533    0.000    0.622    0.000 generic.py:1829(__getattr__)
   769651    0.528    0.000    0.528    0.000 {built-in method min}
   224351    0.509    0.000    2.030    0.000 generic.py:1099(_maybe_update_cacher)
   ...

I will rerun it for confirmation but looks like it certainly has something to do with pandas' to_csv() method, because most of the run time is used on io and the csv writer. Why is it having this effect? Any suggestions?

Update #3:

Well, I did a full %prun test and indeed almost 90% of the time spent is used on {method 'close' of '_io.TextIOWrapper' objects}. So I guess here's the problem... What do you guys think?

My questions here are:

  1. What originates here the decrease in performance?
  2. Does pandas.DataFrames.to_csv() append mode load the whole file each time it writes to it?
  3. Is there a way to enhance the process?

解决方案

In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv rather than elsewhere...

From looking at your code: Firstly there's a lot of copying here and a lot of looping (not enough vectorization)... everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf (fixed width format) rather than to_csv?

Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l.

IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).


Here are a few improvements on your code, which should improve the runtime.

Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
    if count % 100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    # use apply here rather than iterating
    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
    df.index = df.ConcatIndex

    #DataFrame Normalization:
    dftemp = df.very_deep_copy()  # don't *think* you need this

    # drop all excludes
    dftemp.drop(columns_to_drop), axis=1, inplace=True)

    # normalize real cols
    m = dftemp[real_cols_suffix].max()
    m.index = real_cols
    dftemp[real_cols] = dftemp[real_cols] / m

    # normalize child cols
    m = dftemp[child_cols_desc].max()
    m.index = child_cols
    dftemp[child_cols] = dftemp[child_cols] / m

    # normalize remaining
    remaining = list(remaining - child)
    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

    # if this case is important then discard the rows of m with .max() is 0
    #if max != 0:
    #    dftemp[string] = dftemp[string]/max

    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
    # dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc'd if that really was the issue...


Another options which would be faster is to use pytables (HDF5Store) if you didn't need to resulting output to be csv (but I expect you do)...

The best thing to do by far is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).


Ah ha, I'd missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let's keep the file open:

# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

    if picklefile == '0000.p':
        dftemp.to_csv(f)
    else:
        dftemp.to_csv(f, mode='a', header=False)
...

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don't see why this should be that bad, but keeping it open removes the need to do this).

这篇关于性能:Python的大 pandas DataFrame.to_csv追加逐渐变得更慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆