使用 pandas 在非常大型的CSV上进行操作 [英] Operations on a very large csv with pandas

查看:28
本文介绍了使用 pandas 在非常大型的CSV上进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在csv文件上使用熊猫来获取一些值.我的数据如下:

I have been using pandas on csv files to get some values out of them. My data looks like this:

"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"

我有一个简单的脚本来读取csv并按组创建WORD的频率,因此输出如下:

I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:

group freqW1 freqW2
A     1      0
B     1      0
C     0      1

然后对值进行其他一些操作.现在的问题是我必须处理无法保存在内存中的非常大的csv文件(超过20 GB).我在pd.read_csv中尝试了chunksize = x选项,但是因为"TextFileReader"对象不可下标,所以我无法对块进行必要的操作.

Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.

我怀疑有一些简单的方法可以遍历csv并执行我想要的操作.

I suspect there is some easy way to iterate through the csv and do what I want.

我的代码是这样的:

df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)

outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close() 

推荐答案

您可以在 read_csv 调用中指定 chunksize 选项.有关详情,请参见此处

You can specify a chunksize option in the read_csv call. See here for details

或者,您可以使用Python csv库并创建自己的csv Reader或DictReader,然后使用它读取所选大小的数据.

Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.

这篇关于使用 pandas 在非常大型的CSV上进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆