可以 memmap pandas 系列.数据框呢? [英] Can memmap pandas series. What about a dataframe?

查看:22
本文介绍了可以 memmap pandas 系列.数据框呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎我可以通过创建一个 mmap 的 ndarray 并使用它来初始化系列来对 python 系列的底层数据进行内存映射.

It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.

        def assert_readonly(iloc):
           try:
               iloc[0] = 999 # Should be non-editable
               raise Exception("MUST BE READ ONLY (1)")
           except ValueError as e:
               assert "read-only" in e.message

        # Original ndarray
        n = 1000
        _arr = np.arange(0,1000, dtype=float)

        # Convert it to a memmap
        mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
        mm[:] = _arr[:]
        del _arr
        mm.flush()
        mm.flags['WRITEABLE'] = False  # Make immutable!

        # Wrap as a series
        s = pd.Series(mm, name="a")
        assert_readonly(s.iloc)

成功!似乎 s 由只读内存映射 ndarray 支持.我可以对 DataFrame 做同样的事情吗?以下失败

Success! Its seems that s is backed by a read-only mem-mapped ndarray. Can I do the same for a DataFrame? The following fails

        df = pd.DataFrame(s, copy=False, columns=['a'])
        assert_readonly(df["a"]) # Fails

以下成功,但仅适用于一列:

The following succeeds, but only for one column:

        df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False)
        assert_readonly(df["a"]) # Succeeds

...所以我可以无需复制即可制作DF.但是,这只适用于一列,我想要很多.我发现用于组合 1 列 DF 的方法:pd.concat(..copy=False)、pd.merge(copy=False)、...导致副本.

... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.

我有数千个大列作为数据文件,而我一次只需要其中几个.我希望我能够将他们的 mmap 表示放在上面的 DataFrame 中.可能吗?

I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?

Pandas 文档让人有点难以猜测这里的幕后情况——尽管它确实说明了一个 DataFrame "可以被认为是系列对象的类似字典的容器.".我开始认为这不再是这种情况了.

Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.

我宁愿不需要 HD5 来解决这个问题.

I'd prefer not to need HD5 to solve this.

推荐答案

好的 ... 经过大量挖掘,这里是发生了什么.Pandas 的 DataFrame 使用 BlockManager 类在内部组织数据.与文档相反,DataFrame 不是系列的集合,而是 类似 dtyped 矩阵的集合.BlockManger 将所有 float 列组合在一起,将所有 int 列组合在一起,等等......,并且它们的内存(据我所知)保持在一起.

OK ... after a lot of digging here's what's going on. Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

仅当提供单个 ndarray 矩阵(单一类型)时,它才可以在不复制内存的情况下做到这一点.请注意,BlockManager(理论上)在其构造中也支持不复制混合类型数据,因为可能不需要将此输入复制到相同类型的块中.但是,仅当单个矩阵是数据参数时,DataFrame 构造函数不会进行复制.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

简而言之,如果你有混合类型或多个数组作为构造函数的输入,或者提供一个带有单个数组的字典,那么你在 Pandas 中就不走运了,DataFrame 的默认值BlockManager 将复制您的数据.

In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

无论如何,解决此问题的一种方法是强制 BlockManager 不按类型合并,而是将每一列保留为单独的块".所以,用猴子补丁魔法...

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

from pandas.core.internals import BlockManager
class BlockManagerUnconsolidated(BlockManager):
    def __init__(self, *args, **kwargs):
        BlockManager.__init__(self, *args, **kwargs)
        self._is_consolidated = False
        self._known_consolidated = False

    def _consolidate_inplace(self): pass
    def _consolidate(self): return self.blocks


def df_from_arrays(arrays, columns, index):
    from pandas.core.internals import make_block
    def gen():
        _len = None
        p = 0
        for a in arrays:
            if _len is None:
                _len = len(a)
                assert len(index) == _len
            assert _len == len(a)
            yield make_block(values=a.reshape((1,_len)), placement=(p,))
            p+=1

    blocks = tuple(gen())
    mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])
    return pd.DataFrame(mgr, copy=False)

如果 DataFrameBlockManger 有一个 consolidate=False(或假定此行为),如果 copy=False,那就更好了 已指定.

It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

测试:

def assert_readonly(iloc):
    try:
        iloc[0] = 999 # Should be non-editable
        raise Exception("MUST BE READ ONLY (1)")
    except ValueError as e:
        assert "read-only" in e.message

# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)

# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False  # Make immutable!

df = df_from_arrays(
    [mm, mm, mm],
    columns=['a', 'b', 'c'],
    index=range(len(mm)))
assert_read_only(df["a"].iloc)
assert_read_only(df["b"].iloc)
assert_read_only(df["c"].iloc)

对我来说,BlockManager 是否真的有实际好处,需要将类似类型的数据保存在一起,这似乎有点问题——Pandas 中的大多数操作都是按标签行或按列进行的-- 这源于 DataFrame 是异构列的结构,通常仅通过它们的索引关联.尽管可行,他们为每个块"保留一个索引,但如果索引将偏移量保留到块中,则可以获得好处(如果是这种情况,那么他们应该按 sizeof(dtype) 分组,我不这样做)'认为不是这样).哼哼……

It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case). Ho hum...

有一些关于 提供非复制构造函数的 PR 的讨论,被遗弃了.

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

看起来计划淘汰BlockManager,所以你的里程数会有所不同.

It looks like there's sensible plans to phase out BlockManager, so your mileage many vary.

另见幕后的熊猫,这对我帮助很大.

Also see Pandas under the hood, which helped me a lot.

这篇关于可以 memmap pandas 系列.数据框呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆