pandas 读取csv内存 [英] Pandas read csv out of memory

查看：1628 发布时间：2017/2/24 16:20:37 python csv memory

本文介绍了 pandas 读取csv内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用Pandas操作一个大型CSV文件，当我写这个

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep='\t',delimiter='\t')

parser.CParserError：错误标记数据C错误：内存不足
wc -l表示有13822117行，我需要聚合在这个csv文件数据框架上，是否有办法处理这个其他然后拆分csv到几个文件和写代码合并结果？任何建议如何做到这一点？感谢

it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks

输入内容如下：

columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
0:'3M' '2345' '2345' '2014-10-5',3000
1:'3M' '2958' '2152' '2015-3-22',5000
2:'GE' '2183' '2183' '2012-12-31',515
3:'3M' '2958' '2958' '2015-3-10',395
4:'GE' '2183' '2285' '2015-4-19',1925
5:'GE' '2598' '2598' '2015-3-17',1915

所需的输出如下：

columns=[ka,kb,errorNum,errorRate,totalNum of records]
'3M','2345',0,0%,1
'3M','2958',1,50%,2
'GE','2183',1,50%,2
'GE','2598',0,0%,1

如果数据集小，下面的代码可以按照另一个

if the data set is small, the below code could be used as provided by another

df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum':  'sum',
                                             'recordNum': 'count' })

df2['errorRate'] = df2['errorNum'] / df2['recordNum']

ka kb_1  recordNum  errorNum  errorRate

3M 2345          1         0        0.0
   2958          2         1        0.5
GE 2183          2         1        0.5
   2598          1         0        0.0

（定义错误记录：当kb_1！= kb_2时，相应的记录被视为异常记录）

(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)

推荐答案

根据您的代码段在读取chs中的csv文件时出现内存错误，在逐行读取时为
。

Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

我假设 kb_2 是错误指示器，

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

此代码段将所有组存储在字典中，并在读取整个文件后计算错误率。

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

如果组的组合过多，它将遇到内存不足异常。

It will encounter an out-of-memory exception, if there are too many combinations of groups.

这篇关于 pandas 读取csv内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 读取csv内存 [英] Pandas read csv out of memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 读取csv内存 [英] Pandas read csv out of memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭