在python中顺序读取巨大的CSV文件 [英] Sequentially read huge CSV file in python

查看:420
本文介绍了在python中顺序读取巨大的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个10gb的CSV文件,其中包含一些我需要使用的信息.

I have a 10gb CSV file that contains some information that I need to use.

由于我的PC上的内存有限,因此无法单批读取内存中的所有文件.相反,我只想迭代读取此文件的某些行.

As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.

假设在第一次迭代中我想读取前100个,在第二次迭代中要读取101至200,依此类推.

Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.

有没有一种有效的方法可以在Python中执行此任务? 熊猫可以提供一些有用的东西吗?还是有更好的方法(在内存和速度方面)?

Is there an efficient way to perform this task in Python? May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?

推荐答案

这是简短的答案.

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

这是很长的答案.

要开始使用,您需要导入熊猫和sqlalchemy.下面的命令可以做到这一点.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

import pandas as pd
from sqlalchemy import create_engine

接下来,设置一个指向您的csv文件的变量.这不是必需的,但确实有助于重用.

Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.

file = '/path/to/csv/file'

使用这三行代码,我们准备开始分析数据.让我们看一下csv文件的头",看看内容可能是什么样子.

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

print pd.read_csv(file, nrows=5)

此命令使用pandas的"read_csv"命令仅读取5行(行数= 5),然后将这些行打印到屏幕上.这样一来,您就可以了解csv文件的结构,并确保以对您的工作有意义的方式对数据进行格式化.

This command uses pandas’ "read_csv" command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

在我们可以实际使用数据之前,我们需要对其进行一些操作,以便可以开始对其进行过滤以处理数据的子集.通常,这就是我要使用pandas数据框的方式,但是对于大型数据文件,我们需要将数据存储在其他位置.在这种情况下,我们将建立一个本地sqllite数据库,分块读取csv文件,然后将这些块写入sqllite.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

为此,我们首先需要使用以下命令创建sqllite数据库.

To do this, we’ll first need to create the sqllite database using the following command.

csv_database = create_engine('sqlite:///csv_database.db')

接下来,我们需要分批遍历CSV文件并将数据存储到sqllite中.

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j
      i+=1
      df.to_sql('table', csv_database, if_exists='append')
      j = df.index[-1] + 1

使用此代码,我们将块大小设置为100,000,以保持可管理的块大小,并初始化几个迭代器(i = 0,j = 0),然后运行一个for循环. for循环从CSV文件中读取数据块,从任何列名称中删除空间,然后将数据块存储到sqllite数据库(df.to_sql(…)).

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

如果您的CSV文件足够大,则可能要花一些时间,但是花在等待上的时间是值得的,因为您现在可以使用pandas的"sql"工具从数据库中提取数据,而不必担心内存限制.

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

要立即访问数据,您可以运行以下命令:

To access the data now, you can run commands like the following:

df = pd.read_sql_query('SELECT * FROM table', csv_database)

当然,使用"select * ..."会将所有数据加载到内存中,这是我们试图解决的问题,因此您应该将过滤器放入select语句中以过滤数据.例如:

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:

df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)

这篇关于在python中顺序读取巨大的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆