从R中的多个CSV文件计算一列的平均值 [英] Calculate the mean of one column from several CSV files in R

查看:3351
本文介绍了从R中的多个CSV文件计算一列的平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R的新手。我在一个文件夹中有超过300个CSV文件(名为001.csv,002.csv等)。每个包含具有头部的数据帧。我写一个函数,将采取三个参数:文件的位置,您想要计算的平均值(在数据框中)的列的名称和您要在计算中使用的文件。



这里是我的函数:

  pollutantmean2  
#将一个或两个零添加到ID以使它们与CSV文件名匹配
filenames< - sprintf(%03d.csv ,id)

#指向specdata文件夹的路径
#如果没有提供路径,默认是工作目录
filedir< - file.path(目录,文件名)

#从指定路径的所选ID或ID中获取数据
数据集< - read.csv(filedir,header = TRUE)

#计算平均值删除所有NAs
polmean< - mean(dataset $ pollutant,na.rm = TRUE)

#return mean
polmean

}

我的代码似乎有两个问题。为了将其分解,我将函数分成两个单独的函数来处理这两个任务:1)获取所需的文件和2)计算所需列的平均值(aka pollutant )。



1)获取适当的文件 - 只要我只需要一个文件,它就可以工作。如果我选择一个范围的文件,如 1:25 我得到一个错误信息,说:文件中的错误(文件,rt):无效的'description'参数。我有Googled这个错误,但仍然没有如何解决它的线索。

 #函数获取csv文件并存储
getfile < - function(directory = getwd(),id){
filenames< - sprintf(%03d.csv,id)
filedir< - file.path文件名)
dataset< - read.csv(filedir,header = TRUE)
dataset
}



如果我运行 getfile(specdata,1)它工作正常,但如果我运行 getfile specdata,1:10)我得到以下错误:文件(文件,rt)中的错误:无效的'description'参数



2)计算指定列的平均值 - 假设我有一个可用的数据框,然后尝试使用以下函数计算平均值:

  calcMean<  -  function(dataset,pollutant){
polmean< - mean(dataset $ pollutant,na.rm = TRUE)$但是,如果我运行 calcMean(b),但是如果运行









$ b mydata,sulfate)
(其中 mydata 是我手动加载的数据框)我收到一条错误消息:
警告消息:
在mean.default(dataset $ pollutant,na.rm = TRUE):
参数不是数字或逻辑:返回NA



奇怪的是,如果我在控制台中运行 mean(mydata $ sulfate,na.rm = TRUE)罚款。



我会感激任何帮助,我会指向正确的方向。

解决方案

你不需要更多的功能。解决方案可以从我在6行中的理解更简单:

 污染物质<  -  function(directory,pollutant,id = 1: 10){
filenames< - sprintf(%03d.csv,id)
文件名< - 粘贴(目录,文件名,sep =/)
ldf< lapply(fileenames,read.csv)
df = ldply(ldf)
#df是你的数据列表
mean(df [,pollutant],na.rm = TRUE)
}


I am new to R. I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation.

Here is my function:

pollutantmean2 <- function(directory = getwd(), pollutant, id = 1:332) {

    # add one or two zeros to ID so that they match the CSV file names
    filenames <- sprintf("%03d.csv", id)

    # path to specdata folder
    # if no path is provided, default is working directory
    filedir <- file.path(directory, filenames)

    # get the data from selected ID or IDs from the specified path
    dataset <- read.csv(filedir, header = TRUE)

    # calculate mean removing all NAs
    polmean <- mean(dataset$pollutant, na.rm = TRUE)

    # return mean
    polmean

}

It appears there are two things wrong with my code. To break it down, I separated the function into two separate function to handle the two tasks: 1) get the required files and 2) calculate the mean of the desired column (aka pollutant).

1) Getting the appropriate files - It works as long as I only want one file. If I select a range of files, such as 1:25 I get an error message that says Error in file(file, "rt") : invalid 'description' argument. I have Googled this error but still have no clue how to fix it.

# function that gets csv files and stores them
getfile <- function(directory = getwd(), id) {
    filenames <- sprintf("%03d.csv", id)
    filedir <- file.path(directory, filenames)
    dataset <- read.csv(filedir, header = TRUE)
    dataset
}

If I run getfile("specdata", 1) it works fine, but if I run getfile("specdata", 1:10) I get the following error: Error in file(file, "rt") : invalid 'description' argument.

2) Calculating mean of specified named column - Assuming I have a usable data frame, I then try to calculate the mean with the following function:

calcMean <- function(dataset, pollutant) {
    polmean <- mean(dataset$pollutant, na.rm = TRUE)
    polmean
}

But if I run calcMean(mydata, "sulfate") (where mydata is a data frame I loaded manually) I get an error message: Warning message: In mean.default(dataset$pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA

The odd thing is that if I run mean(mydata$sulfate, na.rm = TRUE) in the console, it works fine.

I will appreciate any help that will point me in the right direction. I have researched this for several days and after endless tweaking, I have run out of idea.

解决方案

You do not need more functions. The solution can be simpler from my understanding in 6 lines:

pollutantmean <- function(directory, pollutant, id = 1:10) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep="/")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
# df is your list of data.frames
mean(df[, pollutant], na.rm = TRUE)
}

这篇关于从R中的多个CSV文件计算一列的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆