按出现频率对数据框列进行排序 [英] Sort a dataframe column by the frequency of occurrence

查看:88
本文介绍了按出现频率对数据框列进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个称为df的数据框,可以说三列,

I have a dataframe in called df, there are three column lets say,

Region ID  Salary
1      A1  100
1      A2  1001
1      A3  2000
1      A4  2431
1      A5  1001
..............
..............
2      A6  1002
2      A7  1002
2      A8  1002
3      A9  3001
3      A10 3001
3      A11 4001

现在我想按它们的出现对薪水进行排序使用频率表之类的区域,可以获取每个区域的发生概率并对它们进行排序。请假设数据集足够大(1000行)

Now I want to sort column Salary by the occurrence of them by Region, that is using frequency table or something, get the probability of occurrence per region and sort them. Please assume that the dataset is large enough (1000 rows)

P.S:有人可以建议一种做某事的好方法。请在答案中使用列名,因为实际表的中间有一些列

P.S: Can anyone suggest a good method to do some. Please use column name in your answers since the real table has some column in the middle

预先感谢

                  **EDIT 1**

我想我不清楚足够了,感谢所有答复者,我对此表示不清楚,我们深表歉意。

I think I was not clear enough, thanks for all those who replied, I sincerely apologise for not being clear:

对于当前数据集,我们需要创建一个频率表,例如:

With the current dataset we need to create a frequency table say:

Region  Salary(bin)     Count
1       1K              6                   
1       5K              3                   
1       2K              2                   
1       15K             2                   
1       0.5K            2                   
1       24K             1                   
1       0K              0                   

使用此分类,我们可以在数据框df中添加名为bin(直方图的桶)的新列

using this we can classify add a new columns in our data frame df called bin(bucket from histogram)

Region     ID  Salary  (bin)   Count
    1      A1  100     1K      6
    1      A2  1001    2K      2
    1      A3  2000    2K      2
    1      A4  2431    5K      3

..........................等等... ............

..........................So on...............

我们可以执行以下操作:

We can do the above using:

df$bin <- cut(df$salary, breaks=hist(df$salary)$breaks)

按地区,计数和薪水排序后,我们得到:

After sorting by Region and Count and Salary we get:

Region     ID  Salary  (bin)   Count
    1      A1  100     1K      6
    1      A4  2431    5K      3
    1      A3  2000    2K      2
    1      A2  1001    2K      2

如您所见,我们需要为每个区域创建频率表并进行排序。我使用Tableau进行了上述操作,但我想在R中自动化它。

As you can see, we need to create frequency table for each region and do sort. I did the above using Tableau but I want to automate this in R

希望我很清楚

推荐答案

一种可能的方法是使用 data.table 添加 freq 列,然后进行排序您的数据相应地:

One possible approach is to use data.table to add freq column, then sort your data accordingly:

library(data.table)
setDT(df)[,freq := .N, by = c("Region","Salary")]

# Sort
df[order(freq, decreasing = T),]

# As a oneliner (thx @Jaap)
setDT(df)[, freq := .N, by = .(Region,Salary)][order(-freq)]

这篇关于按出现频率对数据框列进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆