pandas :如何在python3中对混合型多索引使用切片? [英] Pandas: how to use slicing for mixed-type multi-indices in python3?

查看:255
本文介绍了 pandas :如何在python3中对混合型多索引使用切片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如我在这个部分相关的问题中指出的那样,不再可能对混合类型的序列进行排序:

As I noted in this partially related question, it is not possible to sort mixed-type sequences anymore:

# Python3.6
sorted(['foo', 'bar', 10, 200, 3])
# => TypeError: '<' not supported between instances of 'str' and 'int'

这会影响熊猫中的切片查询.以下示例说明了我的问题.

This affects the slicing queries in pandas. The following example illustrates my problem.

import pandas as pd
import numpy as np
index = [(10,3),(10,1),(2,2),('foo',4),('bar',5)]
index = pd.MultiIndex.from_tuples(index)
data = np.random.randn(len(index),2)
table = pd.DataFrame(data=data, index=index)

idx=pd.IndexSlice
table.loc[idx[:10,:],:]
# The last line will raise an UnsortedIndexError because 
# 'foo' and 'bar' appear in the wrong order.

异常消息如下:

UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [0], lexsort depth 0'

在python2.x中,我通过对索引进行lex排序来从此异常中恢复:

In python2.x, I recovered from this exception by lex-sorting the index:

# Python2.x:
table = table.sort_index()

#               0         1
# 2   2  0.020841  0.717178
# 10  1  1.608883  0.807834
#     3  0.566967  1.978718
# bar 5 -0.683814 -0.382024
# foo 4  0.150284 -0.750709

table.loc[idx[:10,:],:]
#              0         1
# 2  2  0.020841  0.717178
# 10 1  1.608883  0.807834
#    3  0.566967  1.978718

但是,在python3中,我遇到了开头提到的异常:

However, in python3 I end up with the exception I mentioned in the beginning:

TypeError: '<' not supported between instances of 'str' and 'int'

如何从中恢复?不能在排序之前将索引转换为字符串,因为这样会破坏索引的正确排序:

How to recover from this? Converting the index to strings before sorting is not an option, because this breaks the proper ordering of the index:

# Python2/3
index = [(10,3),(10,1),(2,2),('foo',4),('bar',5)]
index = list(map(lambda x: tuple(map(str,x)), index))
index = pd.MultiIndex.from_tuples(index)
data = np.random.randn(len(index),2)
table = pd.DataFrame(data=data, index=index)
table = table.sort_index()
#               0         1
# 10  1  0.020841  0.717178
#     3  1.608883  0.807834
# 2   2  0.566967  1.978718
# bar 5 -0.683814 -0.382024
# foo 4  0.150284 -0.750709

使用此顺序,基于值的切片将被破坏.

With this ordering, the value-based slicing will be broken.

table.loc[idx[:10,:],:]     # Raises a TypeError
table.loc[idx[:'10',:],:]   # Misses to return the indices [2,:]

如何从中恢复?

推荐答案

这是我想出的第二个解决方案.就我之前的建议而言,它在不改变lex-sorted表的索引值的范围内要好得多.在这里,我在对表进行排序之前会临时转换非字符串索引,但在排序之后会取消对这些索引的字符串化.

This is a second solution I came up with. It is nicer than my previous suggestion insofar that it does not alter the index values of the lex-sorted table. Here, I temporarily convert the non-string indices before sorting the table, but I de-stringify these indices after sorting.

该解决方案之所以有效,是因为熊猫自然可以处理混合类型索引.似乎只有索引的基于字符串的子集需要进行lex排序. (熊猫内部使用了一个所谓的Categorical对象,该对象似乎可以单独区分字符串和其他类型.)

The solution works because pandas naturally can deal with mixed-type indices. It appears that only the string-based subset of indices needs to be lex-sorted. (Pandas internally uses a so called Categorical object that appears to distinguish between strings and other types on its own.)

import numpy as np
import pandas as pd

def stringifiedSortIndex(table):
    # 1) Stringify the index.
    _stringifyIdx = _StringifyIdx()
    table.index = table.index.map(_stringifyIdx)
    # 2) Sort the index.
    table = table.sort_index()
    # 3) Destringify the sorted table.
    _stringifyIdx.revert = True
    table.index = table.index.map(_stringifyIdx)
    # Return table and IndexSlice together.
    return table

class _StringifyIdx(object):
    def __init__(self):
        self._destringifyMap = dict()
        self.revert = False
    def __call__(self, idx):
        if not self.revert:
            return self._stringifyIdx(idx)
        else:
            return self._destringifyIdx(idx)

    # Stringify whatever needs to be converted.
    # In this example: only ints are stringified.
    @staticmethod
    def _stringify(x):
        if isinstance(x,int):
            x = '%03d' % x
            destringify = int
        else:
            destringify = lambda x: x
        return x, destringify

    def _stringifyIdx(self, idx):
        if isinstance(idx, tuple):
            idx = list(idx)
            destr = [None]*len(idx)
            for i,x in enumerate(idx):
                idx[i], destr[i] = self._stringify(x)
            idx = tuple(idx)
            destr = tuple(destr)
        else:
            idx, destr = self._stringify(idx)
        if self._destringifyMap is not None:
            self._destringifyMap[idx] = destr
        return idx

    def _destringifyIdx(self, idx):
        if idx not in self._destringifyMap:
            raise ValueError(("Index to destringify has not been stringified "
                              "this class instance. Index must not change "
                              "between stringification and destringification."))
        destr = self._destringifyMap[idx]
        if isinstance(idx, tuple):
            assert(len(destr)==len(idx))
            idx = tuple(d(i) for d,i in zip(destr, idx))
        else:
            idx = destr(idx)
        return idx


# Build the table.
index = [(10,3),(10,1),(2,2),('foo',4),('bar',5)]
index = pd.MultiIndex.from_tuples(index)
data = np.random.randn(len(index),2)
table = pd.DataFrame(data=data, index=index)
idx = pd.IndexSlice

table = stringifiedSortIndex(table)
print(table)

# Now, the table rows can be accessed as usual.
table.loc[idx[10],:]
table.loc[idx[:10],:]
table.loc[idx[:'bar',:],:]
table.loc[idx[:,:2],:]

# This works also for simply indexed table.
table = pd.DataFrame(data=data, index=[4,1,'foo',3,'bar'])
table = stringifiedSortIndex(table)
table[:'bar']

这篇关于 pandas :如何在python3中对混合型多索引使用切片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆