按出现频率对数据框列进行排序 [英] Sort a dataframe column by the frequency of occurrence
问题描述
我有一个称为df的数据框,可以说三列,
I have a dataframe in called df, there are three column lets say,
Region ID Salary
1 A1 100
1 A2 1001
1 A3 2000
1 A4 2431
1 A5 1001
..............
..............
2 A6 1002
2 A7 1002
2 A8 1002
3 A9 3001
3 A10 3001
3 A11 4001
现在我想按它们的出现对薪水进行排序使用频率表之类的区域,可以获取每个区域的发生概率并对它们进行排序。请假设数据集足够大(1000行)
Now I want to sort column Salary by the occurrence of them by Region, that is using frequency table or something, get the probability of occurrence per region and sort them. Please assume that the dataset is large enough (1000 rows)
P.S:有人可以建议一种做某事的好方法。请在答案中使用列名,因为实际表的中间有一些列
P.S: Can anyone suggest a good method to do some. Please use column name in your answers since the real table has some column in the middle
预先感谢
**EDIT 1**
我想我不清楚足够了,感谢所有答复者,我对此表示不清楚,我们深表歉意。
I think I was not clear enough, thanks for all those who replied, I sincerely apologise for not being clear:
对于当前数据集,我们需要创建一个频率表,例如:
With the current dataset we need to create a frequency table say:
Region Salary(bin) Count
1 1K 6
1 5K 3
1 2K 2
1 15K 2
1 0.5K 2
1 24K 1
1 0K 0
使用此分类,我们可以在数据框df中添加名为bin(直方图的桶)的新列
using this we can classify add a new columns in our data frame df called bin(bucket from histogram)
Region ID Salary (bin) Count
1 A1 100 1K 6
1 A2 1001 2K 2
1 A3 2000 2K 2
1 A4 2431 5K 3
..........................等等... ............
..........................So on...............
我们可以执行以下操作:
We can do the above using:
df$bin <- cut(df$salary, breaks=hist(df$salary)$breaks)
按地区,计数和薪水排序后,我们得到:
After sorting by Region and Count and Salary we get:
Region ID Salary (bin) Count
1 A1 100 1K 6
1 A4 2431 5K 3
1 A3 2000 2K 2
1 A2 1001 2K 2
如您所见,我们需要为每个区域创建频率表并进行排序。我使用Tableau进行了上述操作,但我想在R中自动化它。
As you can see, we need to create frequency table for each region and do sort. I did the above using Tableau but I want to automate this in R
希望我很清楚
推荐答案
一种可能的方法是使用 data.table
添加 freq
列,然后进行排序您的数据相应地:
One possible approach is to use data.table
to add freq
column, then sort your data accordingly:
library(data.table)
setDT(df)[,freq := .N, by = c("Region","Salary")]
# Sort
df[order(freq, decreasing = T),]
# As a oneliner (thx @Jaap)
setDT(df)[, freq := .N, by = .(Region,Salary)][order(-freq)]
这篇关于按出现频率对数据框列进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!