NA列之间的相关性 [英] Correlation between NA columns
问题描述
我必须编写一个函数来获取数据文件的目录和完整情况的阈值,并计算每个文件中硫酸盐和硝酸盐(两列)之间的相关性,其中完全观察到的情况(所有变量)的数量是大于门槛。该函数应该返回满足阈值要求的监视器的相关向量。如果没有文件满足阈值要求,那么函数应该返回一个长度为0的数字向量。这个函数的原型如下:
我的代码看起来像这样
corr < - function(directory,threshold = 0){
a <-list.files(specdata)$ b (b)($ a
data < - read.csv(paste(directory,/,i,sep =))
x <-complete.cases(data)
j <-sum(as.numeric(x))
sulfate <-data [,2]
硝酸盐<-data [,3]
b <-cor(硫酸盐(b)
其他
数字()
}
$
$ b如果我输入
没有错误信息
z <-corr(specdata)
head(z)
[1 ]不适用
我不知道问题出在哪里。我不知道列中的NA值是否与它有关。我认为我的代码中缺少一些东西。我认为read.csv创建一个唯一的数据框,当我需要一个数据框每个文件,但我不明白为什么在这种情况下返回是NA(当没有门槛)。
然而,如果我引入一个更大的阈值(1000):
$ p $ z <-corr(specdata, 1000)
头(z)
数字(0)
预期输出我需要的是
$ $ $ $ $ b $ $ $ $ $ $ b [1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814
corr< - function(directory,threshold = 0){
##'directory'是一个长度为1的字符向量,表示
##的位置。CSV文件
##'threshold'是长度为1的数字向量,表示
##完全观察到的观测(对所有变量)计算
##所需的硝酸盐和硫酸盐之间的关系;默认为0
##返回相关数字向量
df =完整(目录)
ids = df [df [nobs]> (i,ids){
newRead = read.csv(paste(directory,/,formatC(i))$ id
corrr = numeric()
,宽度= 3,标志=0),
.csv,sep =))
dff = newRead [complete.cases(newRead),]
corrr = c (corrr,cor(dff $ sulfate,dff $ nitrate))
}
return(corrr)
}
complete < - function(directory,id = 1:332)粘贴(目录,/,formatC(i,width = 3,flag =0),
.csv,sep =))
sum(complete.cases(data))
}
nobs = sapply(id,f)
return(data.frame (id,nobs))
}
cr < - corr(specdata,150)
head(cr)
I have to write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate (two columns) from each file where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no files meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows
My code looks like this
corr <- function(directory,threshold=0){
a<-list.files("specdata")
for (i in a) {
data <- read.csv(paste(directory, "/", i, sep =""))
x<-complete.cases(data)
j<-sum(as.numeric(x))
sulfate<-data[,2]
nitrate<-data[,3]
b<-cor(sulfate,nitrate)
}
if (j>threshold)
return(b)
else
numeric()
}
there's no error messege
If I type
z<-corr("specdata")
head(z) [1] NA
I don't know what the problem is. I don't know if NA values in the columns have to do with it. I think something is missing in my code. I think the read.csv creates a unique data frame when I need one data frame per file but I don't see why the return is NA in this case (when there's no threshold).
However, if I introduce a bigger threshold (1000):
z<-corr("specdata",1000)
head(z)
numeric(0)
The expected output I need is
cr <- corr("specdata", 150)
head(cr)
[1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814
this is the correct and running solution you can refer to this
corr <- function(directory, threshold = 0) {
## 'directory' is a character vector of length 1 indicating the location of
## the CSV files
## 'threshold' is a numeric vector of length 1 indicating the number of
## completely observed observations (on all variables) required to compute
## the correlation between nitrate and sulfate; the default is 0
## Return a numeric vector of correlations
df = complete(directory)
ids = df[df["nobs"] > threshold, ]$id
corrr = numeric()
for (i in ids) {
newRead = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"),
".csv", sep = ""))
dff = newRead[complete.cases(newRead), ]
corrr = c(corrr, cor(dff$sulfate, dff$nitrate))
}
return(corrr)
}
complete <- function(directory, id = 1:332) {
f <- function(i) {
data = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"),
".csv", sep = ""))
sum(complete.cases(data))
}
nobs = sapply(id, f)
return(data.frame(id, nobs))
}
cr <- corr("specdata", 150)
head(cr)
这篇关于NA列之间的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!