如何在pandas.read_csv()之前预处理数据 [英] How to pre-process data before pandas.read_csv()

查看:117
本文介绍了如何在pandas.read_csv()之前预处理数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个稍微破损的CSV文件,我想对其进行预处理,然后再使用pandas.read_csv()对其进行读取,即在其上进行一些搜索/替换.

I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.

我试图打开文件并在生成器中进行预处理,然后将其交给read_csv():

I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():

    def in_stream():
    with open("some.csv") as csvfile:
        for line in csvfile:
            l = re.sub(r'","',r',',line)
            yield l

    df = pd.read_csv(in_stream())

可悲的是,这只会引发

ValueError: Invalid file path or buffer object type: <class 'generator'>

尽管,在查看Panda的源代码时,我希望它能够在迭代器(即生成器)上工作.

Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.

我只发现了此[article](在pandas.read_csv()),概述了如何将生成器包装到类似文件的对象中,但它似乎仅适用于字节模式的文件.

I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.

所以最后我要寻找一种模式来构建一个打开文件的管道,逐行读取文件,允许进行预处理,然后将其输入例如pandas.read_csv().

So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().

推荐答案

在进一步研究了Pandas的源代码之后,很明显,它不仅需要迭代,而且还希望将其作为一个文件,表示为具有读取方法(inference.py中的is_file_like()).

After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).

所以,我用旧方法建造了发电机

So, I built a generator the old way

class InFile(object):
def __init__(self, infile):
    self.infile = open(infile)

def __next__(self):
    return self.next()

def __iter__(self):
    return self

def read(self, *args, **kwargs):
    return self.__next__()

def next(self):
    try:
        line: str = self.infile.readline()
        line = re.sub(r'","',r',',line) # do some fixing
        return line
    except:
        self.infile.close()
        raise StopIteration

这在pandas.read_csv()中有效:

This works in pandas.read_csv():

df = pd.read_csv(InFile("some.csv"))

对我来说,这看起来超级复杂,我想知道是否有更好的解决方案(→更优雅).

To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.

这篇关于如何在pandas.read_csv()之前预处理数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆