在python中读取大csv文件的行 [英] reading rows of big csv file in python

查看:795
本文介绍了在python中读取大csv文件的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的csv文件,我不能在内存中完全加载。



我已经检查过了:
在Python中读取大文件的惰性方法?



但是问题是,这是一个正常的读者,我无法找到任何选项指定大小csvReader。



想要将行转换成numpy数组,我不想读任何行一半,因此,而不是指定大小,我想要的东西,我可以在reader中指定no of rows。


$ b $

解决方案

csv.reader 不会将整个文件读入内存。当你遍历 reader 对象时,它会逐行地遍历文件。所以你可以像平常一样使用阅读器,但是在阅读后你的迭代中可以使用 break 你想读的行。您可以在用于实施<$ c $的C代码中看到这一点c> reader object

  reader objecT:
static PyObject *
csv_reader(PyObject * module,PyObject * args,PyObject * keyword_args)
{
PyObject * iterator,* dialect = NULL;
ReaderObj * self = PyObject_GC_New(ReaderObj,& Reader_Type);

if(!self)
return NULL;

self> dialect = NULL;
self> fields = NULL;
self-> input_iter = NULL;
self> field = NULL;
//这里我们不关心
// ...
self> input_iter = PyObject_GetIter(iterator); //这里我们保存在
中传递的迭代器(文件对象)if(self-> input_iter == NULL){
PyErr_SetString(PyExc_TypeError,
argument 1必须是迭代器 );
Py_DECREF(self);
return NULL;
}

static PyObject *
Reader_iternext(ReaderObj * self)//这是当你调用`next(reader_obj)`(这是一个for循环内部)
{
PyObject * fields = NULL;
Py_UCS4 c;
Py_ssize_t pos,linelen;
unsigned int kind;
void * data;
PyObject * lineobj;

if(parse_reset(self)< 0)
return NULL;
do {
lineobj = PyIter_Next(self-> input_iter); //相当于调用`next(input_iter)`
if(lineobj == NULL){
/ *输入结束或异常* /
if(!PyErr_Occurred()& (self> field_len!= 0 ||
self> state == IN_QUOTED_FIELD)){
if(self> dialect-> strict)
PyErr_SetString(_csvstate_global- > error_obj,
意外的数据结束);
else if(parse_save_field(self)> = 0)
break;
}
return NULL;
}

正如你所看到的, next(reader_object)在内部调用 next(file_object)。所以你要逐行迭代,而不是将整个内容读入内存。


I have a very big csv file which I cannot load in memory in full. So I want to read it piece by piece, convert it into numpy array and then do some more processing.

I already checked: Lazy Method for Reading Big File in Python?

But problem here is that it is a normal reader, and I am unable to find any option of specifying size in csvReader.

Also since I want to convert rows into numpy array, i dont want to read any line in half, so rather than specifying size, I want something where I can specify "no of rows" in reader.

Is there any built-in function or easy way to do it.

解决方案

The csv.reader won't read the whole file into memory. It lazily iterates over the file, line by line, as you iterate over the reader object. So you can just use the reader as you normally would, but break from your iteration after you're read however many lines you want to read. You can see this in the C-code used to implement the reader object.

Initializer for the reader objecT:
static PyObject *
csv_reader(PyObject *module, PyObject *args, PyObject *keyword_args)
{
    PyObject * iterator, * dialect = NULL;
    ReaderObj * self = PyObject_GC_New(ReaderObj, &Reader_Type);

    if (!self)
        return NULL;

    self->dialect = NULL;
    self->fields = NULL;
    self->input_iter = NULL;
    self->field = NULL;
    // stuff we dont care about here
    // ...
    self->input_iter = PyObject_GetIter(iterator);  // here we save the iterator (file object) we passed in
    if (self->input_iter == NULL) {
        PyErr_SetString(PyExc_TypeError,
                        "argument 1 must be an iterator");
        Py_DECREF(self);
        return NULL;
    }

static PyObject *
Reader_iternext(ReaderObj *self)  // This is what gets called when you call `next(reader_obj)` (which is what a for loop does internally)
{
    PyObject *fields = NULL;
    Py_UCS4 c;
    Py_ssize_t pos, linelen;
    unsigned int kind;
    void *data;
    PyObject *lineobj;

    if (parse_reset(self) < 0)
        return NULL;
    do {
        lineobj = PyIter_Next(self->input_iter);  // Equivalent to calling `next(input_iter)`
        if (lineobj == NULL) {
            /* End of input OR exception */
            if (!PyErr_Occurred() && (self->field_len != 0 ||
                                      self->state == IN_QUOTED_FIELD)) {
                if (self->dialect->strict)
                    PyErr_SetString(_csvstate_global->error_obj,
                                    "unexpected end of data");
                else if (parse_save_field(self) >= 0)
                    break;
            }
            return NULL;
        }

As you can see, next(reader_object) calls next(file_object) internally. So you're iterating over both line by line, without reading the entire thing into memory.

这篇关于在python中读取大csv文件的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆