可以 memmap pandas 系列.数据框呢? [英] Can memmap pandas series. What about a dataframe?
问题描述
似乎我可以通过创建一个 mmap 的 ndarray 并使用它来初始化系列来对 python 系列的底层数据进行内存映射.
It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.
def assert_readonly(iloc):
try:
iloc[0] = 999 # Should be non-editable
raise Exception("MUST BE READ ONLY (1)")
except ValueError as e:
assert "read-only" in e.message
# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)
# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False # Make immutable!
# Wrap as a series
s = pd.Series(mm, name="a")
assert_readonly(s.iloc)
成功!似乎 s
由只读内存映射 ndarray 支持.我可以对 DataFrame 做同样的事情吗?以下失败
Success! Its seems that s
is backed by a read-only mem-mapped ndarray.
Can I do the same for a DataFrame? The following fails
df = pd.DataFrame(s, copy=False, columns=['a'])
assert_readonly(df["a"]) # Fails
以下成功,但仅适用于一列:
The following succeeds, but only for one column:
df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False)
assert_readonly(df["a"]) # Succeeds
...所以我可以无需复制即可制作DF.但是,这只适用于一列,我想要很多.我发现用于组合 1 列 DF 的方法:pd.concat(..copy=False)、pd.merge(copy=False)、...导致副本.
... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.
我有数千个大列作为数据文件,而我一次只需要其中几个.我希望我能够将他们的 mmap 表示放在上面的 DataFrame 中.可能吗?
I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?
Pandas 文档让人有点难以猜测这里的幕后情况——尽管它确实说明了一个 DataFrame "可以被认为是系列对象的类似字典的容器.".我开始认为这不再是这种情况了.
Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.
我宁愿不需要 HD5 来解决这个问题.
I'd prefer not to need HD5 to solve this.
推荐答案
好的 ... 经过大量挖掘,这里是发生了什么.Pandas 的 DataFrame
使用 BlockManager
类在内部组织数据.与文档相反,DataFrame
不是系列的集合,而是 类似 dtyped 矩阵的集合.BlockManger
将所有 float 列组合在一起,将所有 int 列组合在一起,等等......,并且它们的内存(据我所知)保持在一起.
OK ... after a lot of digging here's what's going on.
Pandas' DataFrame
uses the BlockManager
class to organize the data internally. Contrary to the docs, DataFrame
is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger
groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.
仅当提供单个 ndarray
矩阵(单一类型)时,它才可以在不复制内存的情况下做到这一点.请注意,BlockManager
(理论上)在其构造中也支持不复制混合类型数据,因为可能不需要将此输入复制到相同类型的块中.但是,仅当单个矩阵是数据参数时,DataFrame
构造函数不会进行复制.
It can do that without copying the memory ONLY if a single ndarray
matrix (a single type) is provided. Note, BlockManager
(in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame
constructor doesn't make a copy ONLY if a single matrix is the data parameter.
简而言之,如果你有混合类型或多个数组作为构造函数的输入,或者提供一个带有单个数组的字典,那么你在 Pandas 中就不走运了,DataFrame
的默认值BlockManager
将复制您的数据.
In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame
's default BlockManager
will copy your data.
无论如何,解决此问题的一种方法是强制 BlockManager
不按类型合并,而是将每一列保留为单独的块".所以,用猴子补丁魔法...
In any case, one way to work around this is to force BlockManager
to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...
from pandas.core.internals import BlockManager
class BlockManagerUnconsolidated(BlockManager):
def __init__(self, *args, **kwargs):
BlockManager.__init__(self, *args, **kwargs)
self._is_consolidated = False
self._known_consolidated = False
def _consolidate_inplace(self): pass
def _consolidate(self): return self.blocks
def df_from_arrays(arrays, columns, index):
from pandas.core.internals import make_block
def gen():
_len = None
p = 0
for a in arrays:
if _len is None:
_len = len(a)
assert len(index) == _len
assert _len == len(a)
yield make_block(values=a.reshape((1,_len)), placement=(p,))
p+=1
blocks = tuple(gen())
mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])
return pd.DataFrame(mgr, copy=False)
如果 DataFrame
或 BlockManger
有一个 consolidate=False
(或假定此行为),如果 copy=False,那就更好了
已指定.
It would be better if DataFrame
or BlockManger
had a consolidate=False
(or assumed this behavior) if copy=False
was specified.
测试:
def assert_readonly(iloc):
try:
iloc[0] = 999 # Should be non-editable
raise Exception("MUST BE READ ONLY (1)")
except ValueError as e:
assert "read-only" in e.message
# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)
# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False # Make immutable!
df = df_from_arrays(
[mm, mm, mm],
columns=['a', 'b', 'c'],
index=range(len(mm)))
assert_read_only(df["a"].iloc)
assert_read_only(df["b"].iloc)
assert_read_only(df["c"].iloc)
对我来说,BlockManager
是否真的有实际好处,需要将类似类型的数据保存在一起,这似乎有点问题——Pandas 中的大多数操作都是按标签行或按列进行的-- 这源于 DataFrame
是异构列的结构,通常仅通过它们的索引关联.尽管可行,他们为每个块"保留一个索引,但如果索引将偏移量保留到块中,则可以获得好处(如果是这种情况,那么他们应该按 sizeof(dtype)
分组,我不这样做)'认为不是这样).哼哼……
It seems a little questionable to me whether there's really practical benefits to BlockManager
requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame
being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype)
, which I don't think is the case).
Ho hum...
有一些关于 提供非复制构造函数的 PR 的讨论,被遗弃了.
There was some discussion about a PR to provide a non-copying constructor, which was abandoned.
看起来计划淘汰BlockManager代码>
,所以你的里程数会有所不同.
It looks like there's sensible plans to phase out BlockManager
, so your mileage many vary.
另见幕后的熊猫,这对我帮助很大.
Also see Pandas under the hood, which helped me a lot.
这篇关于可以 memmap pandas 系列.数据框呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!