复制或分组 Pandas DataFrame 时如何保持主序? [英] How to keep major-order when copying or groupby-ing a pandas DataFrame?

查看:54
本文介绍了复制或分组 Pandas DataFrame 时如何保持主序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何按顺序使用或操作(猴子补丁)pandas,以在复制和分组聚合的结果对象上始终保持相同的主顺序?

我使用 pandas.DataFrame 作为业务应用程序(风险模型)中的数据结构,需要快速聚合多维数据.与 Pandas 的聚合在很大程度上取决于底层 numpy 数组使用的主要排序方案.

I use pandas.DataFrame as datastructure within a business application (risk model) and need fast aggregation of multidimensional data. Aggregation with pandas depends crucially on the major-ordering scheme in use on the underlying numpy array.

不幸的是,当我创建副本或使用 groupby 和 sum 执行聚合时,pandas(版本 0.23.4)会更改底层 numpy 数组的主序.

Unfortunatly, pandas (version 0.23.4) changes the major-order of the underlying numpy array when I create a copy or when I perform an aggregation with groupby and sum.

影响是:

案例 1:17.2 秒

案例 2:5 分 46 秒

在一个 DataFrame 及其副本上有 45023 行和 100000 列.对索引执行聚合.索引是一个 pd.MultiIndex,有 15 个级别.聚合保持三个级别并导致大约 239 个组.

on a DataFrame and its copy with 45023 rows and 100000 columns. Aggregation was performed on the index. The index is a pd.MultiIndex with 15 levels. Aggregation keeps three levels and leads to about 239 groups.

我通常在具有 45000 行和 100000 列的 DataFrame 上工作.在这一行,我有一个 pandas.MultiIndex 大约 15 个级别.要计算各种层次结构节点的统计信息,我需要在索引维度上聚合(求和).

I work typically on DataFrames with 45000 rows and 100000 columns. On the row I have a pandas.MultiIndex with about 15 levels. To compute statistics on various hierarchy nodes I need to aggregate (sum) on the index dimension.

聚合很快,如果底层的 numpy 数组是 c_contiguous,因此保持在列主序(C 序)中.如果它是 f_contiguous,那么它是非常慢的,因此按行优先顺序(F 顺序).

Aggregation is fast, if the underlying numpy array is c_contiguous, hence held in column-major-order (C order). It is very slow if it is f_contiguous, hence in row-major-order (F order).

不幸的是,pandas 将主序从 C 更改为 F

  • 创建DataFrame 的副本,甚至当,

通过 grouby 执行聚合,并对石斑鱼求和.因此,生成的 DataFrame 具有不同的主要顺序 (!)

performing an aggregation via a grouby and and taking the sum on the grouper. Hence the resulting DataFrame has a differnt major-order (!)

当然,我可以坚持使用另一个数据模型",只需将 MultiIndex 保留在列上即可.那么当前的熊猫版本总是对我有利.但这是不行的.我认为,可以预料的是,对于正在考虑的两个操作(groupby-sum 和 copy),不应该更改主顺序.

Sure, I could stick to another 'datamodel', just by keeping the MultiIndex on the columns. Then the current pandas version would always work to my favor. But this is a no go. I think, that one can expect, that for the two operations under consideration (groupby-sum and copy) the major-order should not be changed.

import numpy as np
import pandas as pd

print("pandas version: ", pd.__version__)

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)

dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)

dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)

aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)


## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  False
# Aggregated DataFrame is C-contiguous:  False

应保留数据的主要顺序.如果熊猫喜欢切换到隐式偏好,那么它应该允许覆盖它.Numpy 允许在创建副本时输入顺序.

The major order of the data should be preserved. If pandas likes to switch to an implicit preference, then it should allow to overwrite this. Numpy allows to input the order when creating a copy.

pandas 的补丁版本应该会导致

A patched version of pandas should result in

## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  True
# Aggregated DataFrame is C-contiguous:  True

对于上面截取的示例代码.

for the example code snipped above.

推荐答案

Pandas 的 Monkey Patch(0.23.4 或者其他版本)

我创建了一个补丁,想与您分享.这导致了上述问题中提到的性能提升.

Monkey Patch for Pandas (0.23.4 and maybe other versions too)

I created a patch which I would like to share with you. It results in the performance increase mentioned in the question above.

它适用于 0.23.4 版的熊猫.对于其他版本,您需要尝试它是否仍然有效.

It works for pandas version 0.23.4. For other versions you need to try whether it still works.

需要以下两个模块,您可以根据放置它​​们的位置调整导入.

The following two modules are needed, you might adapt the imports depending on where you put them.

memory_layout.py   
memory.py

要修补您的代码,您只需在程序或笔记本的开头导入以下内容并设置内存布局参数.它将猴子补丁pandas 并确保DataFrames 的副本行为具有请求的布局.

To patch your code you simply need to import the following at the very beginning of your program or notebook and to set the memory layout parameter. It will monkey patch pandas and make sure, that copies of DataFrames behave have the requested layout.

from memory_layout import memory_layout
# memory_layout.order = 'F'  # assert F-order on copy
# memory_layout.order = 'K'  # Keep given layout on copy 
memory_layout.order = 'C'  # assert C-order on copy

memory_layout.py

使用以下内容创建文件 memory_layout.py.

memory_layout.py

Create file memory_layout.py with the following content.

import numpy as np
from pandas.core.internals import Block
from memory import memory_layout

# memory_layout.order = 'F'  # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
# memory_layout.order = 'K'  # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
memory_layout.order = 'C'  # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)


def copy(self, deep=True, mgr=None):
    """
    Copy patch on Blocks to set or keep the memory layout
    on copies.

    :param self: `pandas.core.internals.Block`
    :param deep: `bool`
    :param mgr: `BlockManager`
    :return: copy of `pandas.core.internals.Block`
    """
    values = self.values
    if deep:
        if isinstance(values, np.ndarray):
memory_layout))
            values = memory_layout.copy_transposed(values)
memory_layout))
        else:
            values = values.copy()
    return self.make_block_same_class(values)


Block.copy = copy  # Block for pandas 0.23.4: in pandas.core.internals.Block

memory.py

使用以下内容创建文件 memory.py.

memory.py

Create file memory.py with the following content.

"""
Implements MemoryLayout copy factory to change memory layout
of `numpy.ndarrays`.
Depending on the use case, operations on DataFrames can be much
faster if the appropriate memory layout is set and preserved.

The implementation allows for changing the desired layout. Changes apply when
copies or new objects are created, as for example, when slicing or aggregating
via groupby ...

This implementation tries to solve the issue raised on GitHub
https://github.com/pandas-dev/pandas/issues/26502

"""
import numpy as np

_DEFAULT_MEMORY_LAYOUT = 'K'


class MemoryLayout(object):
    """
    Memory layout management for numpy.ndarrays.

    Singleton implementation.

    Example:
    >>> from memory import memory_layout
    >>> memory_layout.order = 'K'  #
    >>> # K ... keep array layout from input
    >>> # C ... set to c-contiguous / column order
    >>> # F ... set to f-contiguous / row order
    >>> array = memory_layout.apply(array)
    >>> array = memory_layout.apply(array, 'C')
    >>> array = memory_layout.copy(array)
    >>> array = memory_layout.apply_on_transpose(array)

    """

    _order = _DEFAULT_MEMORY_LAYOUT
    _instance = None

    @property
    def order(self):
        """
        Return memory layout ordering.

        :return: `str`
        """
        if self.__class__._order is None:
            raise AssertionError("Array layout order not set.")
        return self.__class__._order

    @order.setter
    def order(self, order):
        """
        Set memory layout order.
        Allowed values are 'C', 'F', and 'K'. Raises AssertionError
        when trying to set other values.

        :param order: `str`
        :return: `None`
        """
        assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
        self.__class__._order = order

    def __new__(cls):
        """
        Create only one instance throughout the lifetime of this process.

        :return: `MemoryLayout` instance as singleton
        """
        if cls._instance is None:
            cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
        return cls._instance

    @staticmethod
    def get_from(array):
        """
        Get memory layout from array

        Possible values:
           'C' ... only C-contiguous or column order
           'F' ... only F-contiguous or row order
           'O' ... other: both, C- and F-contiguous or both
           not C- or F-contiguous (as on empty arrays).

        :param array: `numpy.ndarray`
        :return: `str`
        """
        if array.flags.c_contiguous == array.flags.f_contiguous:
            return 'O'
        return {True: 'C', False: 'F'}[array.flags.c_contiguous]

    def apply(self, array, order=None):
        """
        Apply the order set or the order given as input on the array
        given as input.

        Possible values:
           'C' ... apply C-contiguous layout or column order
           'F' ... apply F-contiguous layout or row order
           'K' ... keep the given layout

        :param array: `numpy.ndarray`
        :param order: `str`
        :return: `np.ndarray`
        """
        order = self.__class__._order if order is None else order

        if order == 'K':
            return array

        array_order = MemoryLayout.get_from(array)
        if array_order == order:
            return array

        return np.reshape(np.ravel(array), array.shape, order=order)

    def copy(self, array, order=None):
        """
        Return a copy of the input array with the memory layout set.
        Layout set:
           'C' ... return C-contiguous copy
           'F' ... return F-contiguous copy
           'K' ... return copy with same layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`
        """
        order = order if order is not None else self.__class__._order
        return array.copy(order=self.get_from(array)) if order == 'K' \
            else array.copy(order=order)

    def copy_transposed(self, array):
        """
        Return a copy of the input array in order that its transpose
        has the memory layout set.

        Note: numpy simply changes the memory layout from row to column
        order instead of reshuffling the data in memory.

        Layout set:
           'C' ... return F-contiguous copy
           'F' ... return C-contiguous copy
           'K' ... return copy with oposite (C versus F) layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`

        :param array:
        :return:
        """
        if self.__class__._order == 'K':
            return array.copy(
                order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
        else:
            return array.copy(
                order={'C': 'F', 'F': 'C'}[self.__class__._order])

    def __str__(self):
        return str(self.__class__._order)


memory_layout = MemoryLayout()  # Singleton

这篇关于复制或分组 Pandas DataFrame 时如何保持主序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆