从多个 CSV 文件计算一列的平均值 [英] Calculate the mean of one column from several CSV files
问题描述
我在一个文件夹(名为 001.csv、002.csv 等)中有 300 多个 CSV 文件.每个都包含一个带有标题的数据帧.我正在编写一个函数,它将接受三个参数:文件的位置、要计算平均值的列的名称(在数据框内)以及用于计算的文件.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames), and the files to use in the calculation.
这是我的功能:
pollutantmean2 <- function(directory = getwd(), pollutant, id = 1:332) {
# add one or two zeros to ID so that they match the CSV file names
filenames <- sprintf("%03d.csv", id)
# path to specdata folder
# if no path is provided, default is working directory
filedir <- file.path(directory, filenames)
# get the data from selected ID or IDs from the specified path
dataset <- read.csv(filedir, header = TRUE)
# calculate mean removing all NAs
polmean <- mean(dataset$pollutant, na.rm = TRUE)
# return mean
polmean
}
我的代码似乎有两处错误.为了分解它,我将函数分成两个单独的函数来处理两个任务:1) 获取所需的文件和 2) 计算所需列的平均值(又名 pollutant
).
It appears there are two things wrong with my code. To break it down, I separated the function into two separate function to handle the two tasks: 1) get the required files and 2) calculate the mean of the desired column (aka pollutant
).
任务 1: 获取适当的文件 - 只要我只需要一个文件,它就可以工作.如果我选择一系列文件,例如 1:25
,我会收到一条错误消息,内容为 Error in file(file, "rt") : invalid 'description' argument
.我在谷歌上搜索了这个错误,但仍然不知道如何修复它.
Task 1: Getting the appropriate files - It works as long as I only want one file. If I select a range of files, such as 1:25
I get an error message that says Error in file(file, "rt") : invalid 'description' argument
. I have Googled this error but still have no clue how to fix it.
# function that obtains csv files and stores them
getfile <- function(directory = getwd(), id) {
filenames <- sprintf("%03d.csv", id)
filedir <- file.path(directory, filenames)
dataset <- read.csv(filedir, header = TRUE)
dataset
}
如果我运行 getfile("specdata", 1)
它工作正常,但如果我运行 getfile("specdata", 1:10)
我得到以下错误:文件中的错误(文件,rt"):无效的描述"参数
.
If I run getfile("specdata", 1)
it works fine, but if I run getfile("specdata", 1:10)
I get the following error: Error in file(file, "rt") : invalid 'description' argument
.
任务 2: 计算指定命名列的均值 - 假设我有一个可用的数据框,然后我尝试使用以下函数计算均值:
Task 2: Calculating mean of specified named column - Assuming I have a usable data frame, I then try to calculate the mean with the following function:
calcMean <- function(dataset, pollutant) {
polmean <- mean(dataset$pollutant, na.rm = TRUE)
polmean
}
但是如果我运行 calcMean(mydata, "sulfate")
(其中 mydata
是我手动加载的数据框),我会收到一条错误消息:警告信息:在 mean.default(dataset$pollutant, na.rm = TRUE) 中:参数不是数字或逻辑:返回 NA
But if I run calcMean(mydata, "sulfate")
(where mydata
is a data frame I loaded manually) I get an error message:
Warning message:
In mean.default(dataset$pollutant, na.rm = TRUE) :
argument is not numeric or logical: returning NA
奇怪的是,如果我在控制台中运行 mean(mydata$sulfate, na.rm = TRUE)
,它工作正常.
The odd thing is that if I run mean(mydata$sulfate, na.rm = TRUE)
in the console, it works fine.
我已经研究了几天,经过无休止的调整,我已经没有想法了.
I have researched this for several days and after endless tweaking, I have run out of ideas.
推荐答案
您不需要更多功能.根据我的理解,解决方案可以更简单 6 行:
You do not need more functions. The solution can be simpler from my understanding in 6 lines:
pollutantmean <- function(directory, pollutant, id = 1:10) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep="/")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
# df is your list of data.frames
mean(df[, pollutant], na.rm = TRUE)
}
这篇关于从多个 CSV 文件计算一列的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!