遍历数据以设置值>或<R中的变量为NA [英] Looping through data to set values > or < variable as NA in R

查看:36
本文介绍了遍历数据以设置值>或<R中的变量为NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含带有整数,字符和数字的列.实际数据集比下面给出的例子大得多,但下面是一个还可以通过的小得多的模仿.

I have a data frame containing columns with integers, characters, and numerics. The actual data set is much larger than the example given below, but what is below is a passable and much smaller imitation.

我试图遍历数据,并更改任何大于 mean +(3 *标准偏差)且小于 mean-(3 *标准偏差)的值数字列中的 NA NA .如果一列包含整数或字符,则循环应跳过它并继续到下一列.此外,大多数列已经包含一些 NA 值,并且将包含很多在 mean +/-(3 * sd)范围内的值.这些价值必须保持原样.

I am trying to loop through the data and change any values greater than the mean + (3 * standard deviation) and less than the mean - (3 * standard deviation) to NA in the numeric columns only. If a column contains an integer or character, the loop should skip it and continue onto the next column. Additionally, most columns already contain some NA values and will have lots of values that fall within the mean +/- (3*sd). Those values need to remain as they are.

此脚本的最终目标是在结构相同的将来数据集上使用它,尽管我乐于接受有关软件包的建议,但我希望尽可能使用循环.但是,我距离R专家还很远,并且很乐意接受任何人对我的建议!

The ultimate goal of this script is to use it on future data sets with the same structure and while I am open to suggestions with packages, I would like to use loops if possible. However, I am far from an expert in R and will happily take any and all advice anyone has for me!

我已经为整个脚本设计了一个结构,但是它在第一个 next 语句之后停止.

I have worked out a structure for the overall script, but it stops after the first next statement.

脚本:

data = data.frame(test_data)

for (i in colnames(data)){
  if (class(data$i) == "numeric"){
    m = mean(data$i, na.rm=TRUE)
    sd = sd(data$i, na.rm=TRUE)
  }
    else
      next
  for (j in 1:nrow(data)){
    if (data$i[j,] > (m + 3*sd)){
      data$i[j,] <- NA
    }
    else if (data$i[j,] < (m - 3*sd)){
      data$i[j,] <- NA
    }
    else 
      next
    }
}

用于测试此脚本的数据如下:

The data being used to test this script is as follows:

Trait1 = c(1.1, 1.2, 1.35, 1.1, 1.2, NA, 1000, 1.5, 1.4, 1.6)
Trait2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Trait3 = c(125.1, 119.3, 118.4, NA, 1.1, 122.3, 123.4, 125.7, 121.5, 121.7)
test_data = data.frame(Trait1, Trait2, Trait3)

在此先感谢您提供的任何帮助,非常感谢!

Thank you in advance for any help you have to offer, I greatly appreciate it!

推荐答案

使用 dplyr 并使用 scale()将数字变量转换为z分数,这可以简化为:

Using dplyr and converting the numeric variables to a z-score using scale(), this can be simplified to:

library(dplyr)

test_data %>% 
  mutate_if(is.numeric, ~replace(.x, abs(scale(.x)) > 3, NA))

这篇关于遍历数据以设置值&gt;或&lt;R中的变量为NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆