在pandas.read_csv()中使用自定义对象 [英] Using a custom object in pandas.read_csv()

查看:94
本文介绍了在pandas.read_csv()中使用自定义对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对将自定义对象流式传输到pandas数据框感兴趣.根据文档,任何具有读权限的对象()方法可以使用.但是,即使实现了此功能,我仍然会收到此错误:

I am interested in streaming a custom object into a pandas dataframe. According to the documentation, any object with a read() method can be used. However, even after implementing this function I am still getting this error:

ValueError:无效的文件路径或缓冲区对象类型:

ValueError: Invalid file path or buffer object type: <class '__main__.DataFile'>

这是该对象的简单版本,以及我的调用方式:

Here is a simple version of the object, and how I am calling it:

class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'r') as file:
                for line in file:
                    yield line

import pandas as pd
hours = ['file1.csv', 'file2.csv', 'file3.csv']

data = DataFile(hours)
df = pd.read_csv(data)

我是否缺少某些东西,还是无法在Pandas中使用自定义生成器?当我调用read()方法时,它就可以正常工作.

Am I missing something, or is it just not possible to use a custom generator in Pandas? When I call the read() method it works just fine.

我想使用自定义对象而不是将数据帧并置在一起的原因是,看是否有可能减少内存使用量.我过去曾经使用过 gensim 库,它使使用自定义数据对象真的非常容易,因此我希望找到一些类似的方法.

The reason I want to use a custom object rather than concatenating the dataframes together is to see if it is possible to reduce memory usage. I have used the gensim library in the past, and it makes it really easy to use custom data objects, so I was hoping to find some similar approach.

推荐答案

通过子类化机械蜗牛的iterstream , 您可以将任何可迭代的字节转换为类似文件的对象:

One way to make a file-like object in Python3 by subclassing io.RawIOBase. And using Mechanical snail's iterstream, you can convert any iterable of bytes into a file-like object:

import tempfile
import io
import pandas as pd

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)


class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'rb') as f:
                for line in f:
                    yield line

def make_files(num):
    filenames = []
    for i in range(num):
        with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
            f.write(b'''1,2,3\n4,5,6\n''')
            filenames.append(f.name)
    return filenames

# hours = ['file1.csv', 'file2.csv', 'file3.csv']
hours = make_files(3)
print(hours)
data = DataFile(hours)
df = pd.read_csv(iterstream(data.read()), header=None)

print(df)

打印

   0  1  2
0  1  2  3
1  4  5  6
2  1  2  3
3  4  5  6
4  1  2  3
5  4  5  6

这篇关于在pandas.read_csv()中使用自定义对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆