从 pandas MultiIndex移除一个关卡 [英] Remove a level from a pandas MultiIndex
问题描述
我想从MultiIndex
import pandas as pd
tuples = [(0, 100, 1000),(0, 100, 1001),(0, 100, 1002), (1, 101, 1001)]
index_3levels=pd.MultiIndex.from_tuples(tuples,names=["l1","l2","l3"])
print index_3levels.levels
[Int64Index([0, 1], dtype=int64), Int64Index([100, 101], dtype=int64), Int64Index([1000, 1001, 1002], dtype=int64)]
我想提取前两个级别,以实现:
I would like to extract the first 2 levels, to achieve:
print index_2levels
MultiIndex
[(0, 100), (1, 101)]
droplevel
删除级别,但保留重复项:
droplevel
drops the level but keeps the duplicates:
print index_3levels.droplevel("l3")
MultiIndex
[(0, 100), (0, 100), (0, 100), (1, 101)]
我原则上可以调用unique
将其删除.但是,它看起来不是正确的方法.
有更直接的方法吗?
I could in principle call unique
to remove them. However it does not look the right approach.
Is there a more direct method?
推荐答案
这可能是对droplevel
的增强,也许是通过传递uniquify=True
This could be an enhancement to droplevel
, maybe by passing uniquify=True
In [77]: MultiIndex.from_tuples(index_3levels.droplevel('l3').unique())
Out[77]:
MultiIndex
[(0, 100), (1, 101)]
这是另一种方法
首先创建一些数据
In [226]: def f(i):
return [(i,100,1000),(i,100,1001),(i,100,1002),(i+1,101,1001)]
In [227]: l = []
In [228]: for i in range(1000000):
l.extend(f(i))
In [229]: index_3levels=pd.MultiIndex.from_tuples(l,names=["l1","l2","l3"])
In [230]: len(index_3levels)
Out[230]: 4000000
上面显示的方法
In [238]: %timeit MultiIndex.from_tuples(index_3levels.droplevel(level='l3').unique())
1 loops, best of 3: 2.26 s per loop
让我们将索引分为l1和l2两个部分,然后进行唯一化 这些都是Int64Index
Let's split the index apart to 2 components, l1, and l2 and uniquify, much faster to unique these as these are Int64Index
In [249]: l2 = index_3levels.droplevel(level='l3').droplevel(level='l1').unique()
In [250]: %timeit index_3levels.droplevel(level='l3').droplevel(level='l1').unique()
10 loops, best of 3: 35.3 ms per loop
In [251]: l1 = index_3levels.droplevel(level='l3').droplevel(level='l2').unique()
In [252]: %timeit index_3levels.droplevel(level='l3').droplevel(level='l2').unique()
10 loops, best of 3: 52.2 ms per loop
In [253]: len(l1)
Out[253]: 1000001
In [254]: len(l2)
Out[254]: 2
重新组装
In [255]: %timeit MultiIndex.from_arrays([ np.repeat(l1,len(l2)), np.repeat(l2,len(l1)) ])
10 loops, best of 3: 183 ms per loop
总时间约270ms,相当不错的加速比.请注意,我认为排序可能有所不同,但我认为np.repeate/np.tile的某些组合会起作用
Total time about 270ms, pretty good speedup. Note that I think the ordering may be different, but I think some combination of np.repeate/np.tile will work
这篇关于从 pandas MultiIndex移除一个关卡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!