有效地添加带有 NA 且不知道列名的数字列和行 [英] Efficiently add numeric columns and rows with NA and not knowing colnames

查看:24
本文介绍了有效地添加带有 NA 且不知道列名的数字列和行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个典型的数据框:

Here is a typical data frame:

df <- data.frame(
  'ID' = c("123A","456B","789C","1011","1213")
  , 'Name' = c("Alice","Bobo","Jack","Jill","Zoro")
  , 'Quizzes' = c(13,8,14,NA,15)
  , 'Midterm' = c(13,4,16,7,12)
  , 'Final' = c(15,9,13,6,13)
)
df
    ID  Name Quizzes Midterm Final
1 123A Alice      13      13    15
2 456B  Bobo       8       4     9
3 789C  Jack      14      16    13
4 1011  Jill      NA       7     6
5 1213  Zoro      15      12    13

我想添加数字列(不包括 'ID''Name')来计算 'Grade' 列.然后我想计算每个数字列的平均值、中值、最大值、最小值和标准差.最后,我想将统计数据合并到原始数据框中.

I would like to add the numeric columns (excluding 'ID' and 'Name') to compute a 'Grade' column. Then I'd like to compute the mean, median, max, min, and standard deviation for each of these numeric column. And lastly, I'd like to merge the statistics to the original data frame.

一个问题是 colnames (ID, Name, Quizzes, Midterm, Final 在这个例子中)是未知的.列数也是未知的,它可能有 2 个识别列(在本例中为 IDName)或更多,并且可能有 3 个等级组件(QuizzesMidtermFinal(在本例中)或更多.

One problem is that the colnames (ID, Name, Quizzes, Midterm, Final in this example) are unknown. The number of columns is also unknown, it may have 2 identification columns (ID, Name in this example) or more and may have 3 grade components (Quizzes, Midterm, Final in this example) or more.

但是,我知道第一列总是包含一个唯一标识符.

However, I do know that the first column always contains a unique identifier.

可能存在缺失数据和/或 NA 数据.

There may be missing data and/or NA data.

当按列添加(水平添加)时,我想假设缺失和 NA 被视为零.当按行(垂直添加)添加(或计算任何其他统计数据)时,我想忽略缺失值和 NA 值(将它们视为异常值).

When adding by column (adding horizontally), I'd like to assume that the missing and NAs are treated as zero. When adding (or computing any other statistic) by row (adding vertically), I'd like ignore the missing and NA values (treat them as outliers).

我的困难分为两类:1) 处理 NA 和缺失值,2) 在列名未知时合并数据框.

My difficulties fall into 2 categories: 1) dealing with NA and missing values, 2) merging data frames when colnames are unknown.

df$Means  = rowMeans(df[sapply(df, is.numeric)])
df
    ID  Name Quizzes Midterm Final    Means
1 123A Alice      13      13    15 13.66667
2 456B  Bobo       8       4     9  7.00000
3 789C  Jack      14      16    13 14.33333
4 1011  Jill      NA       7     6       NA
5 1213  Zoro      15      12    13 13.33333

我知道如何删除 NA:

I know how to remove NAs:

df$Means  = rowMeans(df[sapply(df, is.numeric)], na.rm = TRUE)
df
    ID  Name Quizzes Midterm Final    Means
1 123A Alice      13      13    15 13.66667
2 456B  Bobo       8       4     9  7.00000
3 789C  Jack      14      16    13 14.33333
4 1011  Jill      NA       7     6  6.50000
5 1213  Zoro      15      12    13 13.33333

但我想将它们视为零.

第一个问题: 是否有一种单行将 NA 视为零 (0) 而不改变数据框?

First Question: Is there a one-liner to treat NAs as zero (0) without alterning the data frame?

编辑 1: 让我澄清一下,我知道如何用 df[is.na(df)] <-0,但我希望保持原始数据帧的数据不变,保留 NA,而计算意味着将 NA 视为零.

Edit 1: Let me clarify that I know how to replace NAs with 0 in the data frame, with df[is.na(df)] <-0, but I wish to keep the original data frame's data unchanged, keeping the NAs, while computing means with NAs treated as zero.

稍微解释一下:sapply(df, is.numeric) 旨在忽略前两列,我不知道其列名.

A bit of explanation: sapply(df, is.numeric) is intended to ignore the first two columns, whose colnames I do not know.

我还想将统计数据合并到原始数据框中,以方便显示和导出到工作表.我走了一部分路,但不是很远.我尝试调整此处描述的解决方案 在特定行索引处向数据帧添加新行,未附加?

I'd also like to merge the stats into the original dataframe, for convenience of display and export to worksheet. I got part of the way, but not very far. I tried to adapt a solution described here Add new row to dataframe, at specific row-index, not appended?

# create a dataframe of sums
data.frame(ID="Mean",t(colMeans(df[sapply(df, is.numeric)], na.rm = TRUE)))
    ID Quizzes Midterm Final
1 Mean    12.5    10.4  11.2

# add sums to original data frame
newRow <- data.frame(ID="Mean",t(colMeans(df[sapply(df, is.numeric)], na.rm = TRUE)))

insertRow <- function(df, r, p) {
  # df = data frame
  # r  = new row
  # p  = position
  df[seq(p+1,nrow(df)+1),] <- df[seq(p,nrow(df)),]
  df[p,] <- r
  df
} 

insertRow(df[,-1],newRow,nrow(df)+1)

    Name Quizzes Midterm Final
1  Alice    13.0    13.0  15.0
2   Bobo     8.0     4.0   9.0
3   Jack    14.0    16.0  13.0
4   Jill      NA     7.0   6.0
5   Zoro    15.0    12.0  13.0
NA  <NA>    12.5    10.4  11.2
7   <NA>      NA      NA    NA
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1L) :
  invalid factor level, NA generated

第二个问题:如何有效地将我的垂直总和(以及均值和中位数等)合并回原始数据框?回想一下,我不知道列名,我只知道第一列是唯一标识符.解决方案如下所述.

Second Question: How to efficiently merge my vertical sums (and means and medians and so on) back into the original data frame? Recall that I do not know the colnames, I only know that the first column is a unique identifier. A solution is described below.

编辑 2:我避免使用 rbind,因为我正在寻找一种高效的解决方案.url 在特定行索引处向数据帧添加新行,未附加? 状态这是一个避免(通常很慢)rbind 调用的解决方案."我不知道为什么 rbind 可能会很慢,但我按照建议尝试实施针对我目前问题的解决方案.

Edit 2: I avoided using rbind because I am looking for an efficient solution. The url Add new row to dataframe, at specific row-index, not appended? states that "Here's a solution that avoids the (often slow) rbind call." I do not know why rbind might be slow, but I followed the advice in trying to implement the solution given there to my present problem.

谢谢!如果需要,请务必要求澄清.

Thanks! and please do ask for clarification if needed.

编辑 3:

我上面引用的线程,在特定的行索引处向数据帧添加新行,而不是附加?,实际上有一个有效"的解决方案来解决这个问题,避免了上面用 insertRow 函数描述的奇怪行为(我赶紧补充说,奇怪的行为很可能是我滥用函数的结果).这是一个可以解决我的第二个问题的函数:

The thread I cited above, Add new row to dataframe, at specific row-index, not appended?, actually had an "efficient" solution to the problem that avoids the weird behaviour described with the insertRow function above (I hasten to add that the weird behaviour is most likely a result of my misusing the function). Here is a function that works and solves my second question:

insertRow2 <- function(df, r, p) {
  df <- rbind(df,r)
  df <- df[order(c(1:(nrow(df)-1),p-0.5)),]
  row.names(df) <- 1:nrow(df)
  return(df)  
}

insertRow2(df[,-1],newRow,nrow(df)+1)

   Name Quizzes Midterm Final
1 Alice    13.0    13.0  15.0
2  Bobo     8.0     4.0   9.0
3  Jack    14.0    16.0  13.0
4  Jill      NA     7.0   6.0
5  Zoro    15.0    12.0  13.0
6  Mean    12.5    10.4  11.2

至于我的第一个问题,由于没有单行程序即将出现,我创建了这样的自定义函数:

As for my first question, as no one-liner were forthcoming I created custom functions like this:

colMeanz <- function(df) {
    df[is.na(df)] <- 0
    return(colMeans(df))
}

相当不优雅,但你去了.感谢 Llopis 在这方面的帮助.

Rather inelegant, but there you go. Thanks to Llopis for help with this.

对上下文的额外解释:在计算一个学生的平均值时,将 NA 视为零是有意义的,而在计算整个班级的平均值时,将 NA 视为 'na.rm=TRUE' 是有意义的.

Extra explanation for context: when computing one student's mean, it makes sense to treat NA as zero, while when computing the whole class's mean, it makes sense to treat NA with ´na.rm=TRUE´.

推荐答案

假设没有名字我做了这个来测试

Assuming that there is no names I have done this to test it

names(df)<- NULL

第一个问题:要将数据的 de NA 值更改为 0,您可以执行 df[is.na(df)]<-0 (有更多解决方案,但这可能会这样做,只是在堆栈流中搜索)

First Question: To change de NA values of the data to 0 you can do df[is.na(df)]<-0 (There are more solutions but this may do, just search here in stackflow)

df[is.na(df)] <- 0
#    NA    NA NA NA NA
#1 123A Alice 13 13 15
#2 456B  Bobo  8  4  9
#3 789C  Jack 14 16 13
#4 1011  Jill  0  7  6
#5 1213  Zoro 15 12 13

第二个问题:你可以只做 cbind 将新数据加入到最后一列和 cbind 加入新行df 的结尾.例如,此数据接近平均值.我不确定你是否需要照顾 rbind 函数使用的时间,如果它只是少于 100 行,那就很好了.

Second Question: you can do just cbind to join the new data to the last column and cbind to join a new row at the end of the df. As an example this data is proximately the mean. I am not sure you need to take care of the time used by rbind function, if it is just a less than 100 rows it is quite good.

vector <- c(14, 7, 14, 4, 13)
df <- cbind(df, vector)
#     1     2  3  4  5 vector  #Note that the name is the name of the vector
#1 123A Alice 13 13 15     14
#2 456B  Bobo  8  4  9      7
#3 789C  Jack 14 16 13     14
#4 1011  Jill  0  7  6      4
#5 1213  Zoro 15 12 13     13

要更改名称,您可以执行 names(df)<-names.df 是 names.df 您想要获得的名称向量.要做到平均数,中位数,您可以使用 apply 函数,但我不太清楚,无法向您展示如何...

To change the names you can do names(df)<-names.df being names.df a vector of names you want to get. To do the means, medians an so, you can use an apply function but I don't know well enough to show you how...

这篇关于有效地添加带有 NA 且不知道列名的数字列和行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆