可以memmap pandas 系列.数据框呢? [英] Can memmap pandas series. What about a dataframe?

查看:86
本文介绍了可以memmap pandas 系列.数据框呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎我可以通过创建mmap'd ndarray并使用它来初始化Python系列来对python系列的基础数据进行映射.

It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.

        def assert_readonly(iloc):
           try:
               iloc[0] = 999 # Should be non-editable
               raise Exception("MUST BE READ ONLY (1)")
           except ValueError as e:
               assert "read-only" in e.message

        # Original ndarray
        n = 1000
        _arr = np.arange(0,1000, dtype=float)

        # Convert it to a memmap
        mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
        mm[:] = _arr[:]
        del _arr
        mm.flush()
        mm.flags['WRITEABLE'] = False  # Make immutable!

        # Wrap as a series
        s = pd.Series(mm, name="a")
        assert_readonly(s.iloc)

成功!似乎s由只读的内存映射ndarray支持. 我可以对DataFrame做同样的事情吗?以下失败

Success! Its seems that s is backed by a read-only mem-mapped ndarray. Can I do the same for a DataFrame? The following fails

        df = pd.DataFrame(s, copy=False, columns=['a'])
        assert_readonly(df["a"]) # Fails

以下操作成功,但仅适用于一列:

The following succeeds, but only for one column:

        df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False)
        assert_readonly(df["a"]) # Succeeds

...所以我可以制作DF而无需复制.但是,这仅适用于一列,我想要很多.我发现了用于合并1列DF的方法:pd.concat(.. copy = False),pd.merge(copy = False),...生成副本.

... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.

我有成千上万个大列作为数据文件,一次只需要几个.我希望能够将它们的mmap表示形式放在上述DataFrame中.有可能吗?

I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?

Pandas文档使我们很难猜测这里到底发生了什么-尽管它确实说了一个DataFrame .我开始不再是这种情况了.

Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.

我宁愿不需要HD5来解决这个问题.

I'd prefer not to need HD5 to solve this.

推荐答案

确定...经过大量挖掘后,这里是正在发生的事情. 熊猫的DataFrame使用BlockManager类在内部组织数据.与文档相反,DataFrame不是系列的集合,而是类似dtyped矩阵的集合. BlockManger将所有float列,所有int列等一起分组,等等,并且它们的内存(据我所知)保持在一起.

OK ... after a lot of digging here's what's going on. Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

如果仅提供一个ndarray矩阵(单个类型),则无需复制内存即可做到这一点.请注意,BlockManager(理论上)还支持在其构造中不复制混合类型数据,因为可能不必将此输入复制到相同类型的块中.但是,仅当单个矩阵是数据参数时,DataFrame构造函数才创建副本.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

简而言之,如果您将混合类型或多个数组作为构造函数的输入,或者为dict提供单个数组,则您在Pandas中不走运,DataFrame的默认BlockManager将复制您的数据.

In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

在任何情况下,解决此问题的一种方法是强制BlockManager不要按类型合并,而是将每列保留为单独的块".所以,有了猴子补丁魔法……

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

        from pandas.core.internals import BlockManager
        class BlockManagerUnconsolidated(BlockManager):
            def __init__(self, *args, **kwargs):
                BlockManager.__init__(self, *args, **kwargs)
                self._is_consolidated = False
                self._known_consolidated = False

            def _consolidate_inplace(self): pass
            def _consolidate(self): return self.blocks


        def df_from_arrays(arrays, columns, index):
            from pandas.core.internals import make_block
            def gen():
                _len = None
                p = 0
                for a in arrays:
                    if _len is None:
                        _len = len(a)
                        assert len(index) == _len
                    assert _len == len(a)
                    yield make_block(values=a.reshape((1,_len)), placement=(p,))
                    p+=1

            blocks = tuple(gen())
            mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])
            return pd.DataFrame(mgr, copy=False)

如果指定了copy = False,则DataFrame或BlockManger最好具有consolidate = False(或假设是这种行为).

It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

要测试:

    def assert_readonly(iloc):
       try:
           iloc[0] = 999 # Should be non-editable
           raise Exception("MUST BE READ ONLY (1)")
       except ValueError as e:
           assert "read-only" in e.message

    # Original ndarray
    n = 1000
    _arr = np.arange(0,1000, dtype=float)

    # Convert it to a memmap
    mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
    mm[:] = _arr[:]
    del _arr
    mm.flush()
    mm.flags['WRITEABLE'] = False  # Make immutable!

        df = df_from_arrays(
            [mm, mm, mm],
            columns=['a', 'b', 'c'],
            index=range(len(mm)))
        assert_read_only(df["a"].iloc)
        assert_read_only(df["b"].iloc)
        assert_read_only(df["c"].iloc)

BlockManager要求将相似类型的数据保存在一起是否真的有实际的好处,这对我来说似乎有点疑问-Pandas中的大多数操作都是按标签行或按列进行的-这是从DataFrame是通常仅通过其索引关联的异构列的结构.尽管可行的是,它们在每个块"中保留一个索引,但如果索引将偏移量保留在块中,则可以从中受益(如果是这种情况,则应将它们按sizeof(dtype)分组,我认为情况并非如此). 呵呵...

It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case). Ho hum...

有人对提供非复制构造函数的PR进行了讨论 ,它被废弃了.

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

似乎有一个明智的计划逐步淘汰BlockManager ,所以你的里程很多.

It looks like there's sensible plans to phase out BlockManager, so your mileage many vary.

另请参见引擎盖下的熊猫,这对我有很大帮助.

Also see Pandas under the hood, which helped me a lot.

这篇关于可以memmap pandas 系列.数据框呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆