大 pandas ：前N排，每组前N行，等同于ROW_NUMBER OVER（PARTITION BY ... ORDER BY ...） [英] Pandas: top N rows, top N rows per group, equivalent for ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...)

查看：251 发布时间：2017/3/26 0:24:33 python pandas dataframe

本文介绍了大 pandas ：前N排，每组前N行，等同于ROW_NUMBER OVER（PARTITION BY ... ORDER BY ...）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在T-SQL中，什么是相当于TOP函数的python？我正在将我的数据框过滤到前50k行。我已经在线查看，我找不到一个简单的例子。

What is the python equivalent to the TOP function in T-SQL? I'm looking to filter my dataframe to the top 50K rows. I've looked online and I can't find a simple example.

推荐答案

更新：显示不同的大熊猫方法，其中包括：

UPDATE: - shows different pandas approaches, including:

每组顶部N行

顶部带有偏移量的N行

相当于SQL聚合函数：

equivalent for SQL aggregate functions:

ROW_NUMBER（）/ RANK（）OVER（PARTITION BY ... ORDER BY ...）

样本DF：

df = pd.DataFrame({
  'dep': np.random.choice(list('ABC'), 20),
  'manager_id': np.random.randint(0, 10, 20),
  'salary': np.random.randint(5000, 5006, 20)
})

-----------------------原始DF ---------- --------------

----------------------- Original DF ------------------------

In [2]: df
Out[2]:
   dep  manager_id  salary
0    B           5    5005
1    A           6    5001
2    C           8    5000
3    A           7    5000
4    B           0    5002
5    A           3    5003
6    A           2    5004
7    A           2    5004
8    C           3    5002
9    C           4    5001
10   A           9    5002
11   C           9    5000
12   B           8    5004
13   A           1    5003
14   C           7    5005
15   B           0    5002
16   B           2    5003
17   A           4    5000
18   B           2    5003
19   B           7    5003

- ----------------前5行（按原始索引排序）-------------------

------------------ top 5 rows (sorted by original index) -------------------

In [3]: df.head(5)
Out[3]:
  dep  manager_id  salary
0   B           5    5005
1   A           6    5001
2   C           8    5000
3   A           7    5000
4   B           0    5002

---前5行（按排序）manager_id DESC， dep ASC）----

--- top 5 rows (sorted by manager_id DESC, dep ASC) ----

In [4]: df.sort_values(by=['manager_id', 'dep'], ascending=[False,True]).head(5)
Out[4]:
   dep  manager_id  salary
10   A           9    5002
11   C           9    5000
12   B           8    5004
2    C           8    5000
3    A           7    5000

---等价于 SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3 ---

--- equivalent for SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3 ---

In [19]: df.nlargest(5+3, columns=['salary']).tail(5)
Out[19]:
   dep  manager_id  salary
7    A           2    5004
12   B           8    5004
5    A           3    5003
13   A           1    5003
16   B           2    5003

----每个部门的前2名工资（不重复）-----

---- top 2 salaries in each department (no duplicates) -----

---相当于SQL： row_number（）over（分区按DEP顺序由SALARY desc） ---

--- equivalent for SQL: row_number() over(partition by DEP order by SALARY desc) ---

In [7]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
   ...:                 .groupby(['dep'])
   ...:                 .cumcount() + 1)
   ...:    .query('rn < 3')
   ...:    .sort_values(['dep','rn'])
   ...: )
Out[7]:
   dep  manager_id  salary  rn
6    A           2    5004   1
7    A           2    5004   2
0    B           5    5005   1
12   B           8    5004   2
14   C           7    5005   1
8    C           3    5002   2

---每个部门的前2名工资（使用最大） - -

--- top 2 salaries in each department (using "nlargest") ----

In [15]: df.ix[df.groupby('dep')['salary'].nlargest(2).reset_index()['level_1']]
Out[15]:
   dep  manager_id  salary
6    A           2    5004
7    A           2    5004
0    B           5    5005
12   B           8    5004
14   C           7    5005
8    C           3    5002

---每个部门的第二和第三最高工资---

--- second and third highest salaries in each department ---

In [16]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
   ....:                 .groupby(['dep'])
   ....:                 .cumcount() + 1)
   ....:    .query('rn >= 2 and rn <= 3')
   ....:    .sort_values(['dep','rn'])
   ....: )
Out[16]:
   dep  manager_id  salary  rn
7    A           2    5004   2
13   A           1    5003   3
12   B           8    5004   2
18   B           2    5003   3
8    C           3    5002   2
9    C           4    5001   3

---每个部门的前2名工资（重复）----

--- top 2 salaries in each department (with duplicates) ----

---相当于SQL： rank（）over（分区按DEP顺序由SALARY desc） -

--- equivalent for SQL: rank() over(partition by DEP order by SALARY desc) ---

In [18]: (df.assign(rnk=df.groupby(['dep'])['salary']
   ....:                  .rank(method='min', ascending=False))
   ....:    .query('rnk < 3')
   ....:    .sort_values(['dep','rnk'])
   ....: )
Out[18]:
   dep  manager_id  salary  rnk
6    A           2    5004  1.0
7    A           2    5004  1.0
0    B           5    5005  1.0
12   B           8    5004  2.0
14   C           7    5005  1.0
8    C           3    5002  2.0

这篇关于大 pandas ：前N排，每组前N行，等同于ROW_NUMBER OVER（PARTITION BY ... ORDER BY ...）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大 pandas ：前N排，每组前N行，等同于ROW_NUMBER OVER（PARTITION BY ... ORDER BY ...） [英] Pandas: top N rows, top N rows per group, equivalent for ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大 pandas ：前N排，每组前N行，等同于ROW_NUMBER OVER（PARTITION BY ... ORDER BY ...） [英] Pandas: top N rows, top N rows per group, equivalent for ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭