是否存在R函数,用于根据另一列中的最小值删除索引变量的重复项? [英] Is there an R function for dropping duplicates of index variable based on lowest value in another column?

查看:102
本文介绍了是否存在R函数,用于根据另一列中的最小值删除索引变量的重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析学生分数的大数据集.一些学生的重考会产生重复的分数,通常较早的低分数排在他们的重试之上,通常是较高的分数.我想选择他们的最高分数,而每个学生只有一个文件行(我将需要将其与具有相同ID的其他文件合并).
源文件是这样的:

I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:

STUDID   MATRISUBJ  SUBJSCORE
1032        AfrikaansB  2
1032        isiZuluB    7
1033        IsiXhosaB   6
1034        AfrikaansB  1
1034        EnglishB    4
1034        isiZuluB    3

结果应如下所示.

STUDID  MATRISUBJ   SUBJSCORE
1032        isiZuluB    7
1033        isiXhosaB   6
1034        EnglishB    4

请帮忙..我曾经在SPS中执行此过程,但现在无法访问此商业化软件,因此请换成R

Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R

df2[!duplicated(df2[1:1]),]

给出重复项的第一行,但我希望该行具有最高的价值,有时学生尝试另一门学科以获得所需的语言得分

gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages

推荐答案

嘿!最简单的解决方案是使用top_n()函数.这将允许您基于数字列(在您的情况下为SUBJSCORE)选择前n个得分

Heyo! The simplest solution would be to use the top_n() function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE)

以下代码将为您提供所需的内容:)

The following code will give you what you need :)

  library(tidyverse)

  df %>% 
    group_by(STUDID) %>% 
    top_n(1, SUBJSCORE)

这篇关于是否存在R函数,用于根据另一列中的最小值删除索引变量的重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆