是否存在R函数,用于根据另一列中的最小值删除索引变量的重复项? [英] Is there an R function for dropping duplicates of index variable based on lowest value in another column?
问题描述
我正在尝试分析学生分数的大数据集.一些学生的重考会产生重复的分数,通常较早的低分数排在他们的重试之上,通常是较高的分数.我想选择他们的最高分数,而每个学生只有一个文件行(我将需要将其与具有相同ID的其他文件合并).
源文件是这样的:
I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:
STUDID MATRISUBJ SUBJSCORE
1032 AfrikaansB 2
1032 isiZuluB 7
1033 IsiXhosaB 6
1034 AfrikaansB 1
1034 EnglishB 4
1034 isiZuluB 3
结果应如下所示.
STUDID MATRISUBJ SUBJSCORE
1032 isiZuluB 7
1033 isiXhosaB 6
1034 EnglishB 4
请帮忙..我曾经在SPS中执行此过程,但现在无法访问此商业化软件,因此请换成R
Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R
df2[!duplicated(df2[1:1]),]
给出重复项的第一行,但我希望该行具有最高的价值,有时学生尝试另一门学科以获得所需的语言得分
gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages
推荐答案
嘿!最简单的解决方案是使用top_n()
函数.这将允许您基于数字列(在您的情况下为SUBJSCORE
)选择前n个得分
Heyo! The simplest solution would be to use the top_n()
function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE
)
以下代码将为您提供所需的内容:)
The following code will give you what you need :)
library(tidyverse)
df %>%
group_by(STUDID) %>%
top_n(1, SUBJSCORE)
这篇关于是否存在R函数,用于根据另一列中的最小值删除索引变量的重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!