如何查询pandas中的MultiIndex索引列值 [英] How to query MultiIndex index columns values in pandas

查看：3167 发布时间：2018/8/2 12:51:45 python pandas indexing slice multi-index

本文介绍了如何查询pandas中的MultiIndex索引列值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

代码示例：

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])

In [172]: B = np.array([111, 222, 222, 333, 333, 777])

In [173]: C = randint(10, 99, 6)

In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])

In [175]: df.set_index(['A', 'B'], inplace=True)

In [176]: df
Out[176]: 
          C
A   B      
1.1 111  20
    222  31
3.3 222  24
    333  65
5.5 333  22
6.6 777  74

现在，我要检索A值：

Q1 ：在[3.3,6.6]范围内 - 预期回报值：[3.3,5.5,6.6]或[3.3,3.3,5.5,6.6]，如果是最后一个，则为[3.3 ，5.5]或[3.3,3.3,5.5]，如果没有。

Q2 ：在[2.0,4.0]范围内 - 预期回报值：[3.3]或[3.3,3.3 ]

Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]

对于任何其他 MultiIndex 维度也是如此，例如B值：

Q3 ：在重复范围[111,500]中，作为范围内的数据行数 - 预期返回值：[111,222,222,333,333]

Same for any other MultiIndex dimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]

更正式：

我们假设T是一个包含A，B和C列的表。该表包含 n 行。表格单元格是数字，例如A double，B和C整数。让我们创建表T的 DataFrame ，让我们将其命名为DF。设置DF的列A和B索引（没有重复，即没有单独的列A和B作为索引，并作为数据分开），即在这种情况下的A和B MultiIndex 。

Let us assume T is a table with columns A, B and C. The table includes n rows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrame of table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.

问题：

如何在索引上编写查询，例如，查询索引A（或B），比如标签区间[120.0,540.0]？存在标签120.0和540.0。我必须澄清一下，我只对索引列表感兴趣，作为对查询的回复！

如何相同，但是如果标签120.0和540.0不存在，但是有低于120，高于120和低于540，或高于540的标签？

如果Q1和Q2的答案是唯一的指数值，现在相同，但是重复一次，作为索引范围内的数据行数。

我知道上述问题的答案在列的情况下不是索引，但在索引案例中，经过长期的网络研究和 pandas 的功能实验，我没有成功。我现在看到的唯一方法（没有额外的编程）是除了索引之外还有A和B的副本作为数据列。

I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.

推荐答案

通过 MultiIndex 值查询 df ，例如（A> 1.7）和（B <666）：

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]

In [537]: result_df
Out[537]: 
          C
A   B      
3.3 222  43
    333  59
5.5 333  56

因此，例如，如果仍需要'A'索引值：

Hence, to get for example the 'A' index values, if still required:

In [538]: result_df.index.get_level_values('A')
Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

问题是，在大数据帧中，索引选择的性能比排序的常规行选择差10％。在重复性工作中，循环，延迟累积。参见示例：

The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)

In [559]: len(df)
Out[559]: 12857

In [560]: df.sort(inplace=True)

In [561]: df_without_index = df.reset_index()

In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]
1000 loops, best of 3: 562 µs per loop

In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]
1000 loops, best of 3: 507 µs per loop

这篇关于如何查询pandas中的MultiIndex索引列值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何查询pandas中的MultiIndex索引列值 [英] How to query MultiIndex index columns values in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何查询pandas中的MultiIndex索引列值 [英] How to query MultiIndex index columns values in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭