读取块中的csv文件时出现内存不足错误 [英] out of memory error when reading csv file in chunk

查看:5338
本文介绍了读取块中的csv文件时出现内存不足错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大小为2.5 GB的 csv 文件。 2.5 GB表格如下:

  columns = [ka,kb_1,kb_2,timeofEvent,timeInterval] 
0 :'3M''2345''2345''2014-10-5',3000
1:'3M'2958''2152''2015-3-22',5000
2: GE''2183''2183''2012-12-31',515
3:'3M'2958''2958''2015-3-10',395
4:'GE' '2183''2285''2015-4-19',1925
5:'GE''2598''2598''2015-3-17',1915

我想要群组 ka kb_1 得到如下结果:

  columns = [ka,kb,errorNum,errorRate,totalNum of records] $ b'3M','2345',0,0%,1 
'3M','2958',1,50%,2
'GE','2183' ,2
'GE','2598',0,0%,1

定义错误记录:当 kb_1!= kb_2 时,相应的记录被视为异常记录)



我的电脑(ubuntu 12.04)有 16 GB内存 免费-m / p>

 总共使用的可用共享缓冲区缓存
Mem:112809 14476 98333 0 128 10823
- / + buffers /缓存:3524 109285
交换:

0 0 0

我的python文件名为 bigData.py

  import pandas as pd 
import numpy as np

import sys,traceback,os
cksize = 98333#or 1024,chunk size did not work at
try:
dfs = pd.DataFrame()
reader = pd.read_table('data / petaJoined.csv',chunksize = cksize)

for chunk in reader: #when执行这行,错误发生!
pass
#temp = tb_createTopRankTable(chunk)
#dfs.append(temp)
#df = tb_createTopRankTable(dfs)
except:
traceback。 print_exc(file = sys.stdout)






  ipdb> pd .__ version__ 
'0.16.0'

我使用以下命令监视内存用法:

  top 
ps -C python -o%cpu,%mem,cmd

由于大约需要2秒的时间崩溃,所以我可以看到 mem 使用率已达到90%, CPU 使用率已达到100%



当我优化 python bigData.py 时,生成以下错误:

  /usr/local/lib/python2.7/dist-packages/pytz / __ init__.py:29:UserWarning:Module dateutil已从/usr/local/lib/python2.7/dist导入-packages / dateutil / __ init __。pyc,但是/usr/lib/python2.7/dist-packages正被添加到sys.path 
从pkg_resources import resource_stream
/ usr / local / lib / python2。 7 / dist-packages / pytz / __ init__.py:29:UserWarning:模块pytz已经从/usr/local/lib/python2.7/dist-packages/pytz/__init__.pyc导入,但是/ usr / lib / python2 .b / dist-packages正被添加到sys.path
从pkg_resources导入resource_stream
Traceback(最近最后调用):
文件bigData.py,第10行,模块>
for chunk in reader:
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第691行,在__iter__
中.read(self.chunksize)
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第715行,在读取
ret = self。 _engine.read(nrows)
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第1164行,在读取
data = self._reader .read(nrows)
pandas.parser.TextReader.read(pandas / parser.c:7411)中的文件pandas / parser.pyx,行758
文件pandas / parser.pyx ,第792行,在pandas.parser.TextReader._read_low_memory(pandas / parser.c:7819)
文件pandas / parser.pyx,第833行,在pandas.parser.TextReader._read_rows(pandas / parser。 c:8268)
文件pandas / parser.pyx,行820,在pandas.parser.TextReader._tokenize_rows中(pandas / parser.c:8142)
文件pandas / parser.pyx行1758,在pandas.parser.raise_parser_error(pandas / parser.c:20728)
CParserError:错误标记化数据。 C错误:内存不足
分段故障(内核转储)

  /usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29:UserWarning:模块dateutil已导入从/usr/local/lib/python2.7/dist-packages/dateutil/__init__.pyc,但是/usr/lib/python2.7/dist-packages正被添加到sys.path 
从pkg_resources import resource_stream
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29:UserWarning:模块pytz已从/usr/local/lib/python2.7/dist-packages/导入pytz / __ init __。pyc,但/usr/lib/python2.7/dist-packages正被添加到sys.path
从pkg_resources导入resource_stream
回溯(最近最后调用):
文件bigData.py,第10行,位于< module>
for chunk in reader:
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第691行,在__iter__
中.read(self.chunksize)
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第715行,在读取
ret = self。 _engine.read(nrows)
文件/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py,第1164行,在读取
data = self._reader .read(nrows)
pandas.parser.TextReader.read(pandas / parser.c:7411)中的文件pandas / parser.pyx,行758
文件pandas / parser.pyx ,第792行,在pandas.parser.TextReader._read_low_memory(pandas / parser.c:7819)
文件pandas / parser.pyx,第833行,在pandas.parser.TextReader._read_rows(pandas / parser。 c:8268)
文件pandas / parser.pyx,行820,在pandas.parser.TextReader._tokenize_rows中(pandas / parser.c:8142)
文件pandas / parser.pyx行1758,在pandas.parser.raise_parser_error(pandas / parser.c:20728)
CParserError:错误标记化数据。 C错误:内存不足
*** glibc detected *** python:free():invalid pointer:0x00007f750d2a4c0e ***
====== Backtrace:======= =
/lib/x86_64-linux-gnu/libc.so.6(+0x7db26)[0x7f7511529b26]
/usr/local/lib/python2.7/dist-packages/pandas/parser.so (+ 0x4d5a1)[0x7f750d29d5a1]
/usr/local/lib/python2.7/dist-packages/pandas/parser.so(parser_cleanup+0x15)[0x7f750d29de45]
/ usr / local / lib / python2.7 / dist-packages / pandas / parser.so(parser_free + 0x9)[0x7f750d29e039]
/usr/local/lib/python2.7/dist-packages/pandas/parser.so(+ 0xb43e)[ 0x7f750d25b43e]
....
python(PyDict_SetItem + 0x49)[0x577749]
python(_PyModule_Clear + 0x149)[0x4cafb9]
python(PyImport_Cleanup + 0x477)[0x4cb4f7]
python(Py_Finalize + 0x18e)[0x549f0e]
python(Py_Main + 0x3bc)[0x56b56c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f75114cd76d ]
python [0x41bb11]
=======内存映射:========
00400000-00670000 r-xp 00000000 08:01 26612 / usr / bin /python2.7
0086f000-00870000 r - p 0026f000 08:01 26612 / usr / b .......
008d9000-008eb000 rw-p 00000000 00:00 0
01ddb000-036f7000 rw-p 00000000 00:00 0 [heap]
7f748c179000-7f74cc17a000 rw-p 00000000 00:00 0
7f7504000000-7f7504021000 rw-p 00000000 00:00 0
7f7504021000- 7f7508000000 --- p 00000000 00:00 0
7f750bf83000-7f750c285000 rw-p 00000000 00:00 0
7f750c285000-7f750c586000 rw-p 00000000 00:00 0
7f750c586000-7f750c707000 rw-p 00000000 00:00 0
7f750c707000-7f750c711000 r-xp 00000000 08:01 533205 /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
7f750c711000-7f750c911000 --- p 0000a000 08:01 533205 /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
7f750c911000-7f750c912000 r - p 0000a000 08:01 533205 / usr / local / lib / python2 .7 / dist-packages / pandas / _testing.so
7f750c912000-7f750c913000 rw-p 0000b000 08:01 533205 /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
7f750c913000-7f750c914000 rw-p 00000000 00:00 0
7f750c914000-7f750c918000 r-xp 00000000 08:01 2331 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f750c918000-7f750cb17000 --- p 00004000 08:01 2331 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f750cb17000-7f750cb18000 r - p 00003000 08:01 2331 / lib / x86_64-linux-gnu / libuuid.so.1.3.0
7f750cb18000-7f750cb19000 rw-p 00004000 08:01 2331 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
7f750cb19000-7f750cb34000 r-xp 00000000 08 :01 533071 /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
7f750cb34000-7f750cd33000 --- p 0001b000 08:01 533071 /usr/local/lib/python2.7/ dist-packages / pandas / msgpack.so
7f750cd33000-7f750cd34000 r - p 0001a000 08:01 533071 /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
7f750cd34000 -7f750cd38000 rw-p 0001b000 08:01 533071 /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
7f750cd38000-7f750d039000 rw-p 00000000 00:00 0
7f750d039000 -7f750d04e000 r-xp 00000000 08:01 533070 /usr/local/lib/python2.7/dist-packages/pandas/json.so
7f750d04e000-7f750d24e000 --- p 00015000 08:01 533070 / usr / local /lib/python2.7/dist-packages/pandas/json.so
7f750d24e000-7f750d24f000 r - p 00015000 08:01 533070 /usr/local/lib/python2.7/dist-packages/pandas/json .so
7f750d24f000-7f750d250000 rw-p 00016000 08:01 533070 /usr/local/lib/python2.7/dist-packages/pandas/json.so
7f750d250000-7f750d2a9000 r-xp 00000000 08: 01 533270 /usr/local/lib/python2.7/dist-packages/pandas/parser.so
7f750d2a9000-7f750d4a8000 --- p 00059000 08:01 533270 /usr/local/lib/python2.7/dist -packages / pandas / parser.so
7f750d4a8000-7f750d4a9000 r - p 00058000 08:01 533270 /usr/local/lib/python2.7/dist-packages/pandas/parser.so
7f750d4a9000- 7f750d4af000 rw-p 00059000 08:01 533270 /usr/local/lib/python2.7/dist-packages/pandas/parser.so
7f750d4af000-7f750d591000 r-xp 00000000 08:01 49584 / usr / lib / x86_64 -linux-gnu / libstdc ++。so.6.0.16
7f750d591000-7f750d790000 --- p 000e2000 08:01 49584 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f750d790000-7f750d798000 r - p 000e1000 08:01 49584 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f750d798000-7f750d79a000 rw-p 000e9000 08:01 49584 / usr / lib / x86_64-linux-gnu / libstdc ++。so.6.0.16
7f750d79a000-7f750d7af000 rw-p 00000000 00:00 0
7f750d7af000-7f750d7f1000 r-xp 00000000 08:01 530477 / usr / lib / pyshared / python2.7 / matplotlib / _path.so
7f750d7f1000-7f750d9f1000 --- p 00042000 08:01 530477 /usr/lib/pyshared/python2.7/matplotlib/_path.so
7f750d9f1000-7f750d9f3000 r- -p 00042000 08:01 530477 /usr/lib/pyshared/python2.7/matplotlib/_path.so
7f750d9f3000-7f750d9f4000 rw-p 00044000 08:01 530477 /usr/lib/pyshared/python2.7/matplotlib /_path.so
7f750d9f4000-7f750da2d000 r-xp 00000000 08:01 533269 /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
7f750da2d000-7f750dc2c000 --- p 00039000 08:01 533269 /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
7f750dc2c000-7f750dc2d000 r - p 00038000 08:01 533269 / usr / local / lib / python2。 7 / dist-packages / pandas / _sparse.so
7f750dc2d000-7f750dc31000 rw-p 00039000 08:01 533269 /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
7f750dc31000-7f750dc7d000 r-xp 00000000 08:01 533447 /usr/local/lib/python2.7/dist-packages/pandas/_period.so
7f750dc7d000-7f750de7c000 --- p 0004c000 08:01 533447 / usr / local / lib / python2.7 / dist-packages / pandas / _period.so
7f750de7c000-7f750de7d000 r - p 0004b000 08:01 533447 /usr/local/lib/python2.7/dist-packages/pandas/ _period.so
7f750de7d000-7f750de86000 rw-p 0004c000 08:01 533447 /usr/local/lib/python2.7/dist-packages/pandas/_period.so
7f750de86000-7f750de87000 rw-p 00000000 00 :00 0
7f750de87000-7f750debc000 r-xp 00000000 08:01 533203 /usr/local/lib/python2.7/dist-packages/pandas/index.so
7f750debc000-7f750e0bb000 --- p 00035000 08:01 533203 /usr/local/lib/python2.7/dist-packages/pandas/index.so
7f750e0bb000-7f750e0bc000 r - p 00034000 08:01 533203 /usr/local/lib/python2.7 /dist-packages/pandas/index.so
7f750e0bc000-7f750e0c0000 rw-p 00035000 08:01 533203 /usr/local/lib/python2.7/dist-packages/pandas/index.so
7f750e0c0000 -7f750e295000 r-xp 00000000 08:01 533278 /usr/local/lib/python2.7/dist-packages/pandas/algos.so
7f750e295000-7f750e494000 --- p 001d5000 08:01 533278 / usr / local /lib/python2.7/dist-packages/pandas/algos.so
7f750e494000-7f750e495000 r - p 001d4000 08:01 533278 /usr/local/lib/python2.7/dist-packages/pandas/algos .so
7f750e4950007f750e4a9000 rw-p 001d5000 08:01 533278 /usr/local/lib/python2.7/dist-packages/pandas/algos.so
7f750e4a9000-7f750e4ac000 rw-p 00000000 00: 00 0
7f750e4ac000-7f750e4b2000 r-xp 00000000 08:01 48831 /usr/lib/python2.7/lib-dynload/_csv.so
7f750e4b2000-7f750e6b1000 --- p 00006000 08:01 48831 / usr / lib / python2.7 / lib-dynload / _csv.so
7f750e6b1000-7f750e6b2000 r - p 00005000 08:01 48831 /usr/lib/python2.7/lib-dynload/_csv.so
7f750e6b2000-7f750e6b4000 rw-p 00006000 08:01 48831 /usr/lib/python2.7/lib-dynload/_csv.so
7f750e6b4000-7f750e782000 r-xp 00000000 08:01 533449 / usr / local / lib / python2.7 / dist-packages / pandas / lib.so
7f750e782000-7f750e981000 --- p 000ce000 08:01 533449 /usr/local/lib/python2.7/dist-packages/pandas/lib.so
7f750e981000-7f750e982000 r - p 000cd000 08:01 533449 /usr/local/lib/python2.7/dist-packages/pandas/lib.so
7f750e982000-7f750e990000 rw-p 000ce000 08:01 533449 /usr/local/lib/python2.7/dist-packages/pandas/lib.so
7f750e990000-7f750e992000 rw-p 00000000 00:00 0
7f750e992000-7f750ea8f000 r-xp 00000000 08:01 533271 /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
7f750ea8f000-7f750ec8e000 --- p 000fd000 08:01 533271 /usr/local/lib/python2.7/dist-packages /pandas/tslib.so
7f750ec8e000-7f750ec8f000 r - p 000fc000 08:01 533271 /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
7f750ec8f000-7f750eca1000 rw -p 000fd000 08:01 533271 /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
7f750eca1000-7f750eca4000 rw-p 00000000 00:00 0
7f750eca4000-7f750ecc5000 r -xp 00000000 08:01 48837 /usr/lib/python2.7/lib-dynload/_ctypes.so
7f750ecc5000-7f750eec4000 --- p 00021000 08:01 48837 /usr/lib/python2.7/lib- dynload / _ctypes.so
7f750eec4000-7f750eec5000 r - p 00020000 08:01 48837 /usr/lib/python2.7/lib-dynload/_ctypes.so
7f750eec5000-7f750eec9000 rw-p 00021000 08: 01 48837 /usr/lib/python2.7/lib-dynload/_ctypes.so
7f750eec9000-7f750eeca000 rw-p 00000000 00:00 0
7f750eeca000-7f750ef24000 r-xp 00000000 08:01 532046 / usr /local/lib/python2.7/dist-packages/numpy/random/mtrand.so
7f750ef24000-7f750f123000 --- p 0005a000 08:01 532046 /usr/local/lib/python2.7/dist-packages /numpy/random/mtrand.so
7f750f123000-7f750f124000 r - p 00059000 08:01 532046 /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
7f750f124000-7f750f15c000 rw-p 0005a000 08:01 532046 /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
7f750f15c000-7f750f15d000 rw-p 00000000 00:00 0
7f750f15d000-7f750f166000 r-xp 00000000 08:01 532085 /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
7f750f166000-7f750f365000 --- p 00009000 08:01 532085 /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
7f750f365000-7f750f366000 r - p 00008000 08:01 532085 /usr/local/lib/python2.7/ dist-packages / numpy / fft / fftpack_lite.so
7f750f366000-7f750f367000 rw-p 00009000 08:01 532085 /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
7f750f367000-7f750f368000 r-xp 00000000 08:01 48818 /usr/lib/python2.7/lib-dynload/future_builtins.so
7f750f368000-7f750f567000 --- p 00001000 08:01 48818 / usr / lib / python2.7 / lib-dynload / future_builtins.so
7f750f567000-7f750f568000 r - p 00000000 08:01 48818 /usr/lib/python2.7/lib-dynload/future_builtins.so
7f750f568000-7f750f569000 rw-p 00001000 08:01 48818 /usr/lib/python2.7/lib-dynload/future_builtins.so
7f750f569000-7f750f588000 r-xp 00000000 08:01 48815 /usr/lib/python2.7/lib- dynload / _io.so
7f750f588000-7f750f787000 --- p 0001f000 08:01 48815 /usr/lib/python2.7/lib-dynload/_io.so
7f750f787000-7f750f788000 r - p 0001e000 08 :01 48815 /usr/lib/python2.7/lib-dynload/_io.so
7f750f788000-7f750f791000 rw-p 0001f000 08:01 48815 /usr/lib/python2.7/lib-dynload/_io.so
7f750f791000-7f750f907000 r-xp 00000000 08:01 532132 /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
7f750f907000-7f750fb06000 --- p 00176000 08 :01 532132 /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
7f750fb06000-7f750fb07000 r - p 00175000 08:01 532132 / usr / local / lib / python2。 7 / dist-packages / numpy / linalg / _umath_linalg.so
7f750fb07000-7f750fb08000 rw-p 00176000 08:01 532132 /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
7f750fb08000-7f750fba4000 rw-p 00000000 00:00 0
7f750fba4000-7f750fd01000 r-xp 00000000 08:01 532128 /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite .so
7f750fd01000-7f750ff00000 --- p 0015d000 08:01 532128 /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
7f750ff00000-7f750ff01000 r-- p 0015c000 08:01 532128 /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
7f750ff01000-7f750ff02000 rw-p 0015d000 08:01 532128 / usr / local / lib / python2.7 / dist-packages / numpy / linalg / lapack_lite.so
7f750ff02000-7f750ff9d000 rw-p 00000000 00:00 0
7f750ff9d000-7f750ffa3000 r-xp 00000000 08:01 532067 / usr / local / lib / python2.7 / dist-packages / numpy / lib / _compiled_base.so
7f750ffa3000-7f75101a2000 --- p 00006000 08:01 532067 /usr/local/lib/python2.7/dist-packages/numpy/ lib / _compiled_base.so
7f75101a2000-7f75101a3000 r - p 00005000 08:01 532067 /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
7f75101a3000-7f75101a4000 rw-p 00006000 08:01 532067 /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
7f7510265000-7f751028f000 r-xp 00000000 08:01 532108 / usr / local / lib / python2.7 / dist-packages / numpy / core / scalarmath.so
7f751028f000-7f751048e000 --- p 0002a000 08:01 532108 /usr/local/lib/python2.7/dist-packages/numpy/ core / scalarmath.so
7f751048e000-7f751048f000 r - p 00029000 08:01 532108 /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
7f751048f000-7f7510491000 rw-p 0002a000 08:01 532108 /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
7f7510491000-7f75104d2000 rw-p 00000000 00:00 0
7f75104d2000 -7f75104d5000 r-xp 00000000 08:01 48833 /usr/lib/python2.7/lib-dynload/_heapq.so
7f75104d5000-7f75106d4000 --- p 00003000 08:01 48833 /usr/lib/python2.7 /lib-dynload/_heapq.so
7f75106d4000-7f75106d5000 r - p 00002000 08:01 48833 /usr/lib/python2.7/lib-dynload/_heapq.so
7f75106d5000-7f75106d7000 rw-p 00003000 08:01 48833 /usr/lib/python2.7/lib-dynload/_heapq.so
7f75106d7000-7f751073e000 r-xp 00000000 08:01 532118 /usr/local/lib/python2.7/dist-packages /numpy/core/umath.so
7f751073e000-7f751093d000 --- p 00067000 08:01 532118 /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
7f751093d000-7f751093e000 r - p 00066000 08:01 532118 /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
7f751093e000-7f7510942000 rw-p 00067000 08:01 532118 / usr / local / lib / python2.7 / dist-packages / numpy / core / umath.so
7f7510942000-7f7510944000 rw-p 00000000 00:00 0
7f7510944000-7f7510958000 r-xp 00000000 08:01 48804 /usr/lib/python2.7/lib-dynload/datetime.so
7f7510958000-7f7510b57000 --- p 00014000 08:01 48804 /usr/lib/python2.7/lib-dynload/datetime.so
7f7510b57000-7f7510b58000 r - p 00013000 08:01 48804 /usr/lib/python2.7/lib-dynload/datetime.so
7f7510b58000-7f7510b5c000 rw-p 00014000 08:01 48804 / usr / lib /python2.7/lib-dynload/datetime.so
7f7510b5c000-7f7510caf000 r-xp 00000000 08:01 532106 /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
7f7510caf000-7f7510eae000 --- p 00153000 08:01 532106 /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
7f7510eae000-7f7510eb0000 r - p 00152000 08 :01 532106 /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
7f7510eb0000-7f7510ebd000 rw-p 00154000 08:01 532106 /usr/local/lib/python2.7 /dist-packages/numpy/core/multiarray.so
7f7510ebd000-7f7510ecf000 rw-p 00000000 00:00 0
7f7510ecf000-7f7510f08000 r-xp 00000000 08:01 533450 / usr / local / lib / python2 .7 / dist-packages / pandas / hashtable.so
7f7510f08000-7f7511107000 --- p 00039000 08:01 533450 /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
7f7511107000-7f7511108000 r - p 00038000 08:01 533450 /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
7f7511108000-7f751110c000 rw-p 00039000 08:01 533450 / usr / local / lib / python2.7 / dist-packages / pandas / hashtable.so
7f751110c000-7f751110d000 rw-p 00000000 00:00 0
7f751110d000-7f7511296000 r - p 00000000 08:01 58562 / usr / lib / locale / locale-archive
7f7511296000-7f75112ab000 r-xp 00000000 08:01 2312 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f75112ab000-7f75114aa000 --- p 00015000 08:01 2312 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f75114aa000-7f75114ab000 r - p 00014000 08:01 2312 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f75114ab000-7f75114ac000 rw-p 00015000 08:01 2312 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f75114ac000-7f7511660000 r-xp 00000000 08:01 2327 / lib / x86_64-linux- gnu / libc-2.15.so
7f7511660000-7f751185f000 --- p 001b4000 08:01 2327 /lib/x86_64-linux-gnu/libc-2.15.so
7f751185f000-7f7511863000 r - p 001b3000 08 :01 2327 /lib/x86_64-linux-gnu/libc-2.15.so
7f7511863000-7f7511865000 rw-p 001b7000 08:01 2327 /lib/x86_64-linux-gnu/libc-2.15.so
7f7511865000-7f751186a000 rw-p 00000000 00:00 0
7f751186a000-7f7511965000 r-xp 00000000 08:01 2400 /lib/x86_64-linux-gnu/libm-2.15.so
7f7511965000-7f7511b64000 --- p 000fb000 08:01 2400 /lib/x86_64-linux-gnu/libm-2.15.so
7f7511b64000-7f7511b65000 r - p 000fa000 08:01 2400 /lib/x86_64-linux-gnu/libm-2.15.so
7f7511b65000-7f7511b66000 rw-p 000fb000 08:01 2400 /lib/x86_64-linux-gnu/libm-2.15.so
7f7511b66000-7f7511b7c000 r-xp 00000000 08:01 2288 / lib / x86_64-linux -gnu / libz.so.1.2.3.4
7f7511b7c000-7f7511d7b000 --- p 00016000 08:01 2288 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f7511d7b000-7f7511d7c000 r --p 00015000 08:01 2288 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f7511d7c000-7f7511d7d000 rw-p 00016000 08:01 2288 / lib / x86_64-linux-gnu / libz。 so.1.2.3.4
7f7511d7d000-7f7511f2f000 r-xp 00000000 08:01 2279 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f7511f2f000-7f751212e000 --- p 001b2000 08: 01 2279 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f751212e000-7f7512149000 r - p 001b1000 08:01 2279 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f7512149000-7f7512154000 rw-p 001cc000 08:01 2279 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f7512154000-7f7512158000 rw-p 00000000 00:00 0
7f7512158000-7f75121ac000 r-xp 00000000 08:01 2393 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7f75121ac000-7f75123ac000 --- p 00054000 08:01 2393 / lib / x86_64-linux- gnu / libssl.so.1.0.0
7f75123ac000-7f75123af000 r - p 00054000 08:01 2393 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7f75123af000-7f75123b6000 rw- p 00057000 08:01 2393 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7f75123b6000-7f75123b8000 r-xp 00000000 08:01 2283 /lib/x86_64-linux-gnu/libutil-2.15。所以
7f75123b8000-7f75125b7000 --- p 00002000 08:01 2283 /lib/x86_64-linux-gnu/libutil-2.15.so
7f75125b7000-7f75125b8000 r - p 00001000 08:01 2283 / lib / x86_64-linux-gnu / libutil-2.15.so
7f75125b8000-7f75125b9000 rw-p 00002000 08:01 2283 /lib/x86_64-linux-gnu/libutil-2.15.so
7f75125b9000-7f75125bb000 r-xp 00000000 08:01 2406

/lib/x86_64-linux-gnu/ld-2.15.so
7f7512a2d000-7f7512b31000 rw-p 00000000 00:00 0
7f7512b62000-7f7512bea000 rw -p 00000000 00:00 0
7f7512bf7000-7f7512bf9000 rw-p 00000000 00:00 0
7f7512bf9000-7f7512bfa000 rwxp 00000000 00:00 0
7f7512bfa000-7f7512bfc000 rw-p 00000000 00:00 0
7f7512bfc000-7f7512bfd000 r - p 00022000 08:01 2260 /lib/x86_64-linux-gnu/ld-2.15.so
7f7512bfd000-7f7512bff000 rw-p 00023000 08:01 2260 / lib / x86_64- linux-gnu / ld-2.15.so
7ffcf454c000-7ffcf4585000 rw-p 00000000 00:00 0 [stack]
7ffcf459b000-7ffcf459d000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000 -ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
中止(内核转储)

用下面的代码,没有内存问题,但是下面的代码做什么,我的意思是做分组和数据聚合

  with open(data / petaJoined.csv,r)as content:
for line in content:
#print line
pass
#do stuff with line`
content.close()

任何人都知道发生了什么?



其实我想达到

可能会有解决方案?
$ b

注意我已经使用chcs读取csv,但仍然有内存错误



然后,我改变了chunk的大小, strong> bigData.py 文件

  import pandas as pd 
import numpy as np
import sys,traceback,os
import etl2#my self processing flow
reload(etl2)
def iter_chunks(n, df):
while True:
try:
yield df.get_chunk(n)
(StopIteration除外):
break
cksize = 5
try:
dfs = pd.DataFrame()
reader = pd.read_table('data / petaJoined.csv',
chunksize = cksize,
low_memory = False,
iterator = True
)#在iter_chunks(cksize,reader)中为chunk选择合适的

temp = etl2.tb_createTopRankTable(chunk)
dfs.append b $ b df = tb_createTopRankTable(dfs)

读取器中的块:
#pass
#temp = tb_createTopRankTable(chunk)
#dfs.append temp)
#df = tb_createTopRankTable(dfs)
except:
traceback.print_exc(file = sys.stdout)

仍然会在运行一段时间后出现分段错误

  def tb_createTopRankTable df):
try:
key ='name1'
key2 ='name2'
df2 = df.groupby([key,key2])['isError']。 ({ 'errorNum': 'sum','totalParcel': 'count' })
df2['errorRate'] = df2['errorNum'] / df2['totalParcel']
return df2 $ b$b


解决方案

Based on your snippet, when reading line-by-line .



I assume that kb_2 is the error indicator,

groups={} 
with open(\"data/petaJoined.csv\", \"r\") as large_file:
for line in large_file: $b$ b arr=line.split('\t')
#assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
k=arr[0]+','+arr[1] $ b$b if not (k in groups.keys())
groups[k]={'record_count':0, 'error_sum': 0}
groups[k]['record_count']= groups[k]['record_count']+1
groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

$ b$b

It will encounter an out-of-memory exception, if there are too many combinations of groups.


I am processing a csv-file which is 2.5 GB big. The 2.5 GB table looks like this:

columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915

And I want to groupby ka and kb_1 to get the result like this:

columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1

(definition of error Record: when kb_1 != kb_2, the corresponding record is treated as abnormal record )

My computer, which is ubuntu 12.04, has 16 GB memory and free -m returns

             total       used       free     shared    buffers     cached
Mem:        112809      14476      98333          0        128      10823
-/+ buffers/cache:       3524     109285
Swap:    

    0          0          0

My python file is called bigData.py

import pandas as pd
import numpy as np

import sys,traceback,os
cksize=98333 # or 1024, either chunk size didn't work at all
try:
    dfs = pd.DataFrame()
    reader=pd.read_table('data/petaJoined.csv', chunksize=cksize)  

    for chunk in reader:#when executed this line,error occur!
        pass
        #temp=tb_createTopRankTable(chunk)
        #dfs.append(temp)
        #df=tb_createTopRankTable(dfs)
   except:
    traceback.print_exc(file=sys.stdout)


ipdb> pd.__version__
'0.16.0'

I use the following command to monitor the memory usage:

top 
ps -C python -o %cpu,%mem,cmd

Since it takes about 2 seconds to crash, so I can see the mem usage had reached 90% some time, and CPU usage reached 100%

When I excecute python bigData.py, the following error generate:

/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dateutil was already imported from /usr/local/lib/python2.7/dist-packages/dateutil/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import resource_stream
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module pytz was already imported from /usr/local/lib/python2.7/dist-packages/pytz/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import resource_stream
Traceback (most recent call last):
  File "bigData.py", line 10, in <module>
    for chunk in reader:
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 691, in __iter__
    yield self.read(self.chunksize)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
  File "pandas/parser.pyx", line 792, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7819)
  File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
  File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
  File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
CParserError: Error tokenizing data. C error: out of memory
Segmentation fault (core dumped)

or

     /usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dateutil was already imported from /usr/local/lib/python2.7/dist-packages/dateutil/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path
      from pkg_resources import resource_stream
    /usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module pytz was already imported from /usr/local/lib/python2.7/dist-packages/pytz/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path
      from pkg_resources import resource_stream
    Traceback (most recent call last):
      File "bigData.py", line 10, in <module>
        for chunk in reader:
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 691, in __iter__
        yield self.read(self.chunksize)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
        ret = self._engine.read(nrows)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
        data = self._reader.read(nrows)
      File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
      File "pandas/parser.pyx", line 792, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7819)
      File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
      File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
      File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
    CParserError: Error tokenizing data. C error: out of memory
    *** glibc detected *** python: free(): invalid pointer: 0x00007f750d2a4c0e ***
    ====== Backtrace: ========
    /lib/x86_64-linux-gnu/libc.so.6(+0x7db26)[0x7f7511529b26]
    /usr/local/lib/python2.7/dist-packages/pandas/parser.so(+0x4d5a1)[0x7f750d29d5a1]
    /usr/local/lib/python2.7/dist-packages/pandas/parser.so(parser_cleanup+0x15)[0x7f750d29de45]
    /usr/local/lib/python2.7/dist-packages/pandas/parser.so(parser_free+0x9)[0x7f750d29e039]
    /usr/local/lib/python2.7/dist-packages/pandas/parser.so(+0xb43e)[0x7f750d25b43e]
   ....
    python(PyDict_SetItem+0x49)[0x577749]
    python(_PyModule_Clear+0x149)[0x4cafb9]
    python(PyImport_Cleanup+0x477)[0x4cb4f7]
    python(Py_Finalize+0x18e)[0x549f0e]
    python(Py_Main+0x3bc)[0x56b56c]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f75114cd76d]
    python[0x41bb11]
    ======= Memory map: ========
    00400000-00670000 r-xp 00000000 08:01 26612                              /usr/bin/python2.7
    0086f000-00870000 r--p 0026f000 08:01 26612                              /usr/b.......
    008d9000-008eb000 rw-p 00000000 00:00 0 
    01ddb000-036f7000 rw-p 00000000 00:00 0                                  [heap]
    7f748c179000-7f74cc17a000 rw-p 00000000 00:00 0 
    7f7504000000-7f7504021000 rw-p 00000000 00:00 0 
    7f7504021000-7f7508000000 ---p 00000000 00:00 0 
    7f750bf83000-7f750c285000 rw-p 00000000 00:00 0 
    7f750c285000-7f750c586000 rw-p 00000000 00:00 0 
    7f750c586000-7f750c707000 rw-p 00000000 00:00 0 
    7f750c707000-7f750c711000 r-xp 00000000 08:01 533205                     /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
    7f750c711000-7f750c911000 ---p 0000a000 08:01 533205                     /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
    7f750c911000-7f750c912000 r--p 0000a000 08:01 533205                     /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
    7f750c912000-7f750c913000 rw-p 0000b000 08:01 533205                     /usr/local/lib/python2.7/dist-packages/pandas/_testing.so
    7f750c913000-7f750c914000 rw-p 00000000 00:00 0 
    7f750c914000-7f750c918000 r-xp 00000000 08:01 2331                       /lib/x86_64-linux-gnu/libuuid.so.1.3.0
    7f750c918000-7f750cb17000 ---p 00004000 08:01 2331                       /lib/x86_64-linux-gnu/libuuid.so.1.3.0
    7f750cb17000-7f750cb18000 r--p 00003000 08:01 2331                       /lib/x86_64-linux-gnu/libuuid.so.1.3.0
    7f750cb18000-7f750cb19000 rw-p 00004000 08:01 2331                       /lib/x86_64-linux-gnu/libuuid.so.1.3.0
    7f750cb19000-7f750cb34000 r-xp 00000000 08:01 533071                     /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
    7f750cb34000-7f750cd33000 ---p 0001b000 08:01 533071                     /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
    7f750cd33000-7f750cd34000 r--p 0001a000 08:01 533071                     /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
    7f750cd34000-7f750cd38000 rw-p 0001b000 08:01 533071                     /usr/local/lib/python2.7/dist-packages/pandas/msgpack.so
    7f750cd38000-7f750d039000 rw-p 00000000 00:00 0 
    7f750d039000-7f750d04e000 r-xp 00000000 08:01 533070                     /usr/local/lib/python2.7/dist-packages/pandas/json.so
    7f750d04e000-7f750d24e000 ---p 00015000 08:01 533070                     /usr/local/lib/python2.7/dist-packages/pandas/json.so
    7f750d24e000-7f750d24f000 r--p 00015000 08:01 533070                     /usr/local/lib/python2.7/dist-packages/pandas/json.so
    7f750d24f000-7f750d250000 rw-p 00016000 08:01 533070                     /usr/local/lib/python2.7/dist-packages/pandas/json.so
    7f750d250000-7f750d2a9000 r-xp 00000000 08:01 533270                     /usr/local/lib/python2.7/dist-packages/pandas/parser.so
    7f750d2a9000-7f750d4a8000 ---p 00059000 08:01 533270                     /usr/local/lib/python2.7/dist-packages/pandas/parser.so
    7f750d4a8000-7f750d4a9000 r--p 00058000 08:01 533270                     /usr/local/lib/python2.7/dist-packages/pandas/parser.so
    7f750d4a9000-7f750d4af000 rw-p 00059000 08:01 533270                     /usr/local/lib/python2.7/dist-packages/pandas/parser.so
    7f750d4af000-7f750d591000 r-xp 00000000 08:01 49584                      /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7f750d591000-7f750d790000 ---p 000e2000 08:01 49584                      /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7f750d790000-7f750d798000 r--p 000e1000 08:01 49584                      /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7f750d798000-7f750d79a000 rw-p 000e9000 08:01 49584                      /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7f750d79a000-7f750d7af000 rw-p 00000000 00:00 0 
    7f750d7af000-7f750d7f1000 r-xp 00000000 08:01 530477                     /usr/lib/pyshared/python2.7/matplotlib/_path.so
    7f750d7f1000-7f750d9f1000 ---p 00042000 08:01 530477                     /usr/lib/pyshared/python2.7/matplotlib/_path.so
    7f750d9f1000-7f750d9f3000 r--p 00042000 08:01 530477                     /usr/lib/pyshared/python2.7/matplotlib/_path.so
    7f750d9f3000-7f750d9f4000 rw-p 00044000 08:01 530477                     /usr/lib/pyshared/python2.7/matplotlib/_path.so
    7f750d9f4000-7f750da2d000 r-xp 00000000 08:01 533269                     /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
    7f750da2d000-7f750dc2c000 ---p 00039000 08:01 533269                     /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
    7f750dc2c000-7f750dc2d000 r--p 00038000 08:01 533269                     /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
    7f750dc2d000-7f750dc31000 rw-p 00039000 08:01 533269                     /usr/local/lib/python2.7/dist-packages/pandas/_sparse.so
    7f750dc31000-7f750dc7d000 r-xp 00000000 08:01 533447                     /usr/local/lib/python2.7/dist-packages/pandas/_period.so
    7f750dc7d000-7f750de7c000 ---p 0004c000 08:01 533447                     /usr/local/lib/python2.7/dist-packages/pandas/_period.so
    7f750de7c000-7f750de7d000 r--p 0004b000 08:01 533447                     /usr/local/lib/python2.7/dist-packages/pandas/_period.so
    7f750de7d000-7f750de86000 rw-p 0004c000 08:01 533447                     /usr/local/lib/python2.7/dist-packages/pandas/_period.so
    7f750de86000-7f750de87000 rw-p 00000000 00:00 0 
    7f750de87000-7f750debc000 r-xp 00000000 08:01 533203                     /usr/local/lib/python2.7/dist-packages/pandas/index.so
    7f750debc000-7f750e0bb000 ---p 00035000 08:01 533203                     /usr/local/lib/python2.7/dist-packages/pandas/index.so
    7f750e0bb000-7f750e0bc000 r--p 00034000 08:01 533203                     /usr/local/lib/python2.7/dist-packages/pandas/index.so
    7f750e0bc000-7f750e0c0000 rw-p 00035000 08:01 533203                     /usr/local/lib/python2.7/dist-packages/pandas/index.so
    7f750e0c0000-7f750e295000 r-xp 00000000 08:01 533278                     /usr/local/lib/python2.7/dist-packages/pandas/algos.so
    7f750e295000-7f750e494000 ---p 001d5000 08:01 533278                     /usr/local/lib/python2.7/dist-packages/pandas/algos.so
    7f750e494000-7f750e495000 r--p 001d4000 08:01 533278                     /usr/local/lib/python2.7/dist-packages/pandas/algos.so
    7f750e495000-7f750e4a9000 rw-p 001d5000 08:01 533278                     /usr/local/lib/python2.7/dist-packages/pandas/algos.so
    7f750e4a9000-7f750e4ac000 rw-p 00000000 00:00 0 
    7f750e4ac000-7f750e4b2000 r-xp 00000000 08:01 48831                      /usr/lib/python2.7/lib-dynload/_csv.so
    7f750e4b2000-7f750e6b1000 ---p 00006000 08:01 48831                      /usr/lib/python2.7/lib-dynload/_csv.so
    7f750e6b1000-7f750e6b2000 r--p 00005000 08:01 48831                      /usr/lib/python2.7/lib-dynload/_csv.so
    7f750e6b2000-7f750e6b4000 rw-p 00006000 08:01 48831                      /usr/lib/python2.7/lib-dynload/_csv.so
    7f750e6b4000-7f750e782000 r-xp 00000000 08:01 533449                     /usr/local/lib/python2.7/dist-packages/pandas/lib.so
    7f750e782000-7f750e981000 ---p 000ce000 08:01 533449                     /usr/local/lib/python2.7/dist-packages/pandas/lib.so
    7f750e981000-7f750e982000 r--p 000cd000 08:01 533449                     /usr/local/lib/python2.7/dist-packages/pandas/lib.so
    7f750e982000-7f750e990000 rw-p 000ce000 08:01 533449                     /usr/local/lib/python2.7/dist-packages/pandas/lib.so
    7f750e990000-7f750e992000 rw-p 00000000 00:00 0 
    7f750e992000-7f750ea8f000 r-xp 00000000 08:01 533271                     /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
    7f750ea8f000-7f750ec8e000 ---p 000fd000 08:01 533271                     /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
    7f750ec8e000-7f750ec8f000 r--p 000fc000 08:01 533271                     /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
    7f750ec8f000-7f750eca1000 rw-p 000fd000 08:01 533271                     /usr/local/lib/python2.7/dist-packages/pandas/tslib.so
    7f750eca1000-7f750eca4000 rw-p 00000000 00:00 0 
    7f750eca4000-7f750ecc5000 r-xp 00000000 08:01 48837                      /usr/lib/python2.7/lib-dynload/_ctypes.so
    7f750ecc5000-7f750eec4000 ---p 00021000 08:01 48837                      /usr/lib/python2.7/lib-dynload/_ctypes.so
    7f750eec4000-7f750eec5000 r--p 00020000 08:01 48837                      /usr/lib/python2.7/lib-dynload/_ctypes.so
    7f750eec5000-7f750eec9000 rw-p 00021000 08:01 48837                      /usr/lib/python2.7/lib-dynload/_ctypes.so
    7f750eec9000-7f750eeca000 rw-p 00000000 00:00 0 
    7f750eeca000-7f750ef24000 r-xp 00000000 08:01 532046                     /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
    7f750ef24000-7f750f123000 ---p 0005a000 08:01 532046                     /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
    7f750f123000-7f750f124000 r--p 00059000 08:01 532046                     /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
    7f750f124000-7f750f15c000 rw-p 0005a000 08:01 532046                     /usr/local/lib/python2.7/dist-packages/numpy/random/mtrand.so
    7f750f15c000-7f750f15d000 rw-p 00000000 00:00 0 
    7f750f15d000-7f750f166000 r-xp 00000000 08:01 532085                     /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
    7f750f166000-7f750f365000 ---p 00009000 08:01 532085                     /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
    7f750f365000-7f750f366000 r--p 00008000 08:01 532085                     /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
    7f750f366000-7f750f367000 rw-p 00009000 08:01 532085                     /usr/local/lib/python2.7/dist-packages/numpy/fft/fftpack_lite.so
    7f750f367000-7f750f368000 r-xp 00000000 08:01 48818                      /usr/lib/python2.7/lib-dynload/future_builtins.so
    7f750f368000-7f750f567000 ---p 00001000 08:01 48818                      /usr/lib/python2.7/lib-dynload/future_builtins.so
    7f750f567000-7f750f568000 r--p 00000000 08:01 48818                      /usr/lib/python2.7/lib-dynload/future_builtins.so
    7f750f568000-7f750f569000 rw-p 00001000 08:01 48818                      /usr/lib/python2.7/lib-dynload/future_builtins.so
    7f750f569000-7f750f588000 r-xp 00000000 08:01 48815                      /usr/lib/python2.7/lib-dynload/_io.so
    7f750f588000-7f750f787000 ---p 0001f000 08:01 48815                      /usr/lib/python2.7/lib-dynload/_io.so
    7f750f787000-7f750f788000 r--p 0001e000 08:01 48815                      /usr/lib/python2.7/lib-dynload/_io.so
    7f750f788000-7f750f791000 rw-p 0001f000 08:01 48815                      /usr/lib/python2.7/lib-dynload/_io.so
    7f750f791000-7f750f907000 r-xp 00000000 08:01 532132                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
    7f750f907000-7f750fb06000 ---p 00176000 08:01 532132                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
    7f750fb06000-7f750fb07000 r--p 00175000 08:01 532132                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
    7f750fb07000-7f750fb08000 rw-p 00176000 08:01 532132                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/_umath_linalg.so
    7f750fb08000-7f750fba4000 rw-p 00000000 00:00 0 
    7f750fba4000-7f750fd01000 r-xp 00000000 08:01 532128                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
    7f750fd01000-7f750ff00000 ---p 0015d000 08:01 532128                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
    7f750ff00000-7f750ff01000 r--p 0015c000 08:01 532128                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
    7f750ff01000-7f750ff02000 rw-p 0015d000 08:01 532128                     /usr/local/lib/python2.7/dist-packages/numpy/linalg/lapack_lite.so
    7f750ff02000-7f750ff9d000 rw-p 00000000 00:00 0 
    7f750ff9d000-7f750ffa3000 r-xp 00000000 08:01 532067                     /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
    7f750ffa3000-7f75101a2000 ---p 00006000 08:01 532067                     /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
    7f75101a2000-7f75101a3000 r--p 00005000 08:01 532067                     /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
    7f75101a3000-7f75101a4000 rw-p 00006000 08:01 532067                     /usr/local/lib/python2.7/dist-packages/numpy/lib/_compiled_base.so
    7f7510265000-7f751028f000 r-xp 00000000 08:01 532108                     /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
    7f751028f000-7f751048e000 ---p 0002a000 08:01 532108                     /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
    7f751048e000-7f751048f000 r--p 00029000 08:01 532108                     /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
    7f751048f000-7f7510491000 rw-p 0002a000 08:01 532108                     /usr/local/lib/python2.7/dist-packages/numpy/core/scalarmath.so
    7f7510491000-7f75104d2000 rw-p 00000000 00:00 0 
    7f75104d2000-7f75104d5000 r-xp 00000000 08:01 48833                      /usr/lib/python2.7/lib-dynload/_heapq.so
    7f75104d5000-7f75106d4000 ---p 00003000 08:01 48833                      /usr/lib/python2.7/lib-dynload/_heapq.so
    7f75106d4000-7f75106d5000 r--p 00002000 08:01 48833                      /usr/lib/python2.7/lib-dynload/_heapq.so
    7f75106d5000-7f75106d7000 rw-p 00003000 08:01 48833                      /usr/lib/python2.7/lib-dynload/_heapq.so
    7f75106d7000-7f751073e000 r-xp 00000000 08:01 532118                     /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
    7f751073e000-7f751093d000 ---p 00067000 08:01 532118                     /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
    7f751093d000-7f751093e000 r--p 00066000 08:01 532118                     /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
    7f751093e000-7f7510942000 rw-p 00067000 08:01 532118                     /usr/local/lib/python2.7/dist-packages/numpy/core/umath.so
    7f7510942000-7f7510944000 rw-p 00000000 00:00 0 
    7f7510944000-7f7510958000 r-xp 00000000 08:01 48804                      /usr/lib/python2.7/lib-dynload/datetime.so
    7f7510958000-7f7510b57000 ---p 00014000 08:01 48804                      /usr/lib/python2.7/lib-dynload/datetime.so
    7f7510b57000-7f7510b58000 r--p 00013000 08:01 48804                      /usr/lib/python2.7/lib-dynload/datetime.so
    7f7510b58000-7f7510b5c000 rw-p 00014000 08:01 48804                      /usr/lib/python2.7/lib-dynload/datetime.so
    7f7510b5c000-7f7510caf000 r-xp 00000000 08:01 532106                     /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
    7f7510caf000-7f7510eae000 ---p 00153000 08:01 532106                     /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
    7f7510eae000-7f7510eb0000 r--p 00152000 08:01 532106                     /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
    7f7510eb0000-7f7510ebd000 rw-p 00154000 08:01 532106                     /usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so
    7f7510ebd000-7f7510ecf000 rw-p 00000000 00:00 0 
    7f7510ecf000-7f7510f08000 r-xp 00000000 08:01 533450                     /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
    7f7510f08000-7f7511107000 ---p 00039000 08:01 533450                     /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
    7f7511107000-7f7511108000 r--p 00038000 08:01 533450                     /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
    7f7511108000-7f751110c000 rw-p 00039000 08:01 533450                     /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so
    7f751110c000-7f751110d000 rw-p 00000000 00:00 0 
    7f751110d000-7f7511296000 r--p 00000000 08:01 58562                      /usr/lib/locale/locale-archive
    7f7511296000-7f75112ab000 r-xp 00000000 08:01 2312                       /lib/x86_64-linux-gnu/libgcc_s.so.1
    7f75112ab000-7f75114aa000 ---p 00015000 08:01 2312                       /lib/x86_64-linux-gnu/libgcc_s.so.1
    7f75114aa000-7f75114ab000 r--p 00014000 08:01 2312                       /lib/x86_64-linux-gnu/libgcc_s.so.1
    7f75114ab000-7f75114ac000 rw-p 00015000 08:01 2312                       /lib/x86_64-linux-gnu/libgcc_s.so.1
    7f75114ac000-7f7511660000 r-xp 00000000 08:01 2327                       /lib/x86_64-linux-gnu/libc-2.15.so
    7f7511660000-7f751185f000 ---p 001b4000 08:01 2327                       /lib/x86_64-linux-gnu/libc-2.15.so
    7f751185f000-7f7511863000 r--p 001b3000 08:01 2327                       /lib/x86_64-linux-gnu/libc-2.15.so
    7f7511863000-7f7511865000 rw-p 001b7000 08:01 2327                       /lib/x86_64-linux-gnu/libc-2.15.so
    7f7511865000-7f751186a000 rw-p 00000000 00:00 0 
    7f751186a000-7f7511965000 r-xp 00000000 08:01 2400                       /lib/x86_64-linux-gnu/libm-2.15.so
    7f7511965000-7f7511b64000 ---p 000fb000 08:01 2400                       /lib/x86_64-linux-gnu/libm-2.15.so
    7f7511b64000-7f7511b65000 r--p 000fa000 08:01 2400                       /lib/x86_64-linux-gnu/libm-2.15.so
    7f7511b65000-7f7511b66000 rw-p 000fb000 08:01 2400                       /lib/x86_64-linux-gnu/libm-2.15.so
    7f7511b66000-7f7511b7c000 r-xp 00000000 08:01 2288                       /lib/x86_64-linux-gnu/libz.so.1.2.3.4
    7f7511b7c000-7f7511d7b000 ---p 00016000 08:01 2288                       /lib/x86_64-linux-gnu/libz.so.1.2.3.4
    7f7511d7b000-7f7511d7c000 r--p 00015000 08:01 2288                       /lib/x86_64-linux-gnu/libz.so.1.2.3.4
    7f7511d7c000-7f7511d7d000 rw-p 00016000 08:01 2288                       /lib/x86_64-linux-gnu/libz.so.1.2.3.4
    7f7511d7d000-7f7511f2f000 r-xp 00000000 08:01 2279                       /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    7f7511f2f000-7f751212e000 ---p 001b2000 08:01 2279                       /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    7f751212e000-7f7512149000 r--p 001b1000 08:01 2279                       /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    7f7512149000-7f7512154000 rw-p 001cc000 08:01 2279                       /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
    7f7512154000-7f7512158000 rw-p 00000000 00:00 0 
    7f7512158000-7f75121ac000 r-xp 00000000 08:01 2393                       /lib/x86_64-linux-gnu/libssl.so.1.0.0
    7f75121ac000-7f75123ac000 ---p 00054000 08:01 2393                       /lib/x86_64-linux-gnu/libssl.so.1.0.0
    7f75123ac000-7f75123af000 r--p 00054000 08:01 2393                       /lib/x86_64-linux-gnu/libssl.so.1.0.0
    7f75123af000-7f75123b6000 rw-p 00057000 08:01 2393                       /lib/x86_64-linux-gnu/libssl.so.1.0.0
    7f75123b6000-7f75123b8000 r-xp 00000000 08:01 2283                       /lib/x86_64-linux-gnu/libutil-2.15.so
    7f75123b8000-7f75125b7000 ---p 00002000 08:01 2283                       /lib/x86_64-linux-gnu/libutil-2.15.so
    7f75125b7000-7f75125b8000 r--p 00001000 08:01 2283                       /lib/x86_64-linux-gnu/libutil-2.15.so
    7f75125b8000-7f75125b9000 rw-p 00002000 08:01 2283                       /lib/x86_64-linux-gnu/libutil-2.15.so
    7f75125b9000-7f75125bb000 r-xp 00000000 08:01 2406                                            

/lib/x86_64-linux-gnu/ld-2.15.so
    7f7512a2d000-7f7512b31000 rw-p 00000000 00:00 0 
    7f7512b62000-7f7512bea000 rw-p 00000000 00:00 0 
    7f7512bf7000-7f7512bf9000 rw-p 00000000 00:00 0 
    7f7512bf9000-7f7512bfa000 rwxp 00000000 00:00 0 
    7f7512bfa000-7f7512bfc000 rw-p 00000000 00:00 0 
    7f7512bfc000-7f7512bfd000 r--p 00022000 08:01 2260                       /lib/x86_64-linux-gnu/ld-2.15.so
    7f7512bfd000-7f7512bff000 rw-p 00023000 08:01 2260                       /lib/x86_64-linux-gnu/ld-2.15.so
    7ffcf454c000-7ffcf4585000 rw-p 00000000 00:00 0                          [stack]
    7ffcf459b000-7ffcf459d000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
    Aborted (core dumped)

with below code, there is no memory problem, but what can the below code do , I mean doing group by and data aggregation

with open("data/petaJoined.csv", "r") as content:
    for line in content:
        #print line
        pass
     #do stuff with line` 
    content.close()

Anyone knows what is happening?

Actually I want to reach the result shown in pandas read csv out of memory

Maybe there will be a solution?

Note I already use read csv by chunk, but still there is memory error

Then, I changed the chunk size to have my bigData.py file in another way

import pandas as pd
import numpy  as np
import sys, traceback, os
import etl2                                    # my self processing flow
reload(etl2)
def iter_chunks(n,df):
    while True:
        try:
           yield df.get_chunk(n)
        except StopIteration:
            break
cksize=5
try:
    dfs = pd.DataFrame()
    reader=pd.read_table( 'data/petaJoined.csv',
                          chunksize   = cksize,
                          low_memory  = False,
                          iterator    = True
                          )                    # choose as appropriate
    for chunk in iter_chunks(cksize,reader):
        temp=etl2.tb_createTopRankTable(chunk)
        dfs.append(temp)
    df=tb_createTopRankTable(dfs)
    #
    # for chunk in reader:
    #     pass
    # temp=tb_createTopRankTable(chunk)
    # dfs.append(temp)
    # df=tb_createTopRankTable(dfs)
except:
    traceback.print_exc(file=sys.stdout)

Still, there will be segmentation error after running for sometime

def tb_createTopRankTable(df):
    try:
        key='name1'
        key2='name2'
        df2 = df.groupby([key,key2])['isError'].agg({ 'errorNum':  'sum','totalParcel': 'count' })
        df2['errorRate'] = df2['errorNum'] / df2['totalParcel']
        return df2

解决方案

Based on your snippet, when reading line-by-line.

I assume that kb_2 is the error indicator,

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.

这篇关于读取块中的csv文件时出现内存不足错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆