pandas DataFrame中的级别是多少? [英] What are levels in a pandas DataFrame?

查看:86
本文介绍了 pandas DataFrame中的级别是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读文档,许多解释和示例都将levels视为理所当然.恕我直言,文档缺乏对数据结构和定义的基本解释.

I've been reading through the documentation and many explanations and examples use levels as something taken for granted. Imho the docs lack a bit on a fundamental explanation of the data structure and definitions.

数据框中的级别是什么? MultiIndex索引中的级别是什么?

What are levels in a data frame? What are levels in a MultiIndex index?

推荐答案

我在分析我自己的问题的答案时偶然发现了这个问题,但是我没有发现约翰的答案足够令人满意.经过几次实验后,我想我了解了水平并决定分享:

I stumbled across this question while analyzing the answer to my own question, but I didn't find the John's answer satisfying enough. After a few experiments though I think I understood the levels and decided to share:

简短答案:

级别是索引或列的一部分.

Levels are parts of the index or column.

详细答案:

我认为这个多列DataFrame.groupby示例很好地说明了索引级别.

I think this multi-column DataFrame.groupby example illustrates the index levels quite nicely.

假设我们有时间登录问题报告数据:

Let's say we have the time logged on issues report data:

report = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

   IssueKey  TimeSpent  User
0         1         10  John
1         1         20  John
2         1         30   Tom
3         1         10   Bob
4         2         25  John
5         2         15   Bob

此处的索引只有1级(只有一个索引值标识每一行).索引是人工的(运行编号),由0到5的值组成.

The index here has only 1 level (there is only one index value identifying every row). The index is artificial (running number) and consists of values form 0 to 5.

说我们要合并(加和)同一用户创建的所有日志到同一问题(以获取用户花费在该问题上的总时间)

Say we want to merge (sum) all logs created by the same user to the same issue (to get the total time spent on the issue by the user)

time_logged_by_user = report.groupby(['IssueKey', 'User']).TimeSpent.sum()

IssueKey  User
1         Bob     10
          John    30
          Tom     30
2         Bob     15
          John    25

现在,我们的数据索引具有2个级别,因为多个用户记录了同一问题的时间.级别为IssueKeyUser.这些级别是索引的一部分(只有它们一起才能标识DataFrame/系列中的一行).

Now our data index has 2 levels, as multiple users logged time to the same issue. The levels are IssueKey and User. The levels are parts of the index (only together they can identify a row in a DataFrame / Series).

作为索引一部分(作为元组)的级别可以在Spyder Variable资源管理器中很好地观察到:

Levels being parts of the index (as a tuple) can be nicely observed in the Spyder Variable explorer:

具有级别使我们有机会根据自己选择的索引部分(级别)汇总组内的值.例如.如果我们要分配任何用户花费在问题上的最长时间,我们可以:

Having levels gives us opportunity to aggregate values within groups in respect to an index part (level) of our choice. E.g. if we want to assign the max time spent on an issue by any user, we can:

max_time_logged_to_an_issue = time_logged_by_user.groupby(level='IssueKey').transform('max')

IssueKey  User
1         Bob     30
          John    30
          Tom     30
2         Bob     25
          John    25

现在,前3行的值是30,因为它们对应于问题1(在上面的代码中忽略了User级别).关于问题2.

Now the first 3 rows have the value 30, as they correspond to the issue 1 (User level was ignored in the code above). The same story for the issue 2.

这可能很有用,例如如果我们想找出哪些用户在每个问题上花费的时间最多:

This can be useful e.g. if we want to find out which users spent most time on every issue:

issue_owners = time_logged_by_user[time_logged_by_user == max_time_logged_to_an_issue]

IssueKey  User
1         John    30
          Tom     30
2         John    25

这篇关于 pandas DataFrame中的级别是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆