根据用户输入读取多个文件并计算平均值 [英] Reading multiple files and calculating mean based on user input

查看:139
本文介绍了根据用户输入读取多个文件并计算平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  1. 目录

  2. >污染物

  3. id

我的电脑上有一个目录, 300.这个函数做什么会在下面的原型中显示出来:

  pollutantmean < - 函数(目录,污染物,id = 1:332){
##'directory'是一个长度为1的字符向量,表示
## CSV文件的位置

##'pollutant'是一个字符长度为1的矢量表示
##我们将计算
##的平均污染物的名称; 硫酸盐或硝酸盐。

##'id'是一个整数向量,指示监视器ID号
##被使用

##返回所有监视器中污染物的平均值在'id'向量中列出
##(忽略NA值)
}



<这个函数的一个输出示例如下:

pre $ $ $ $ $ $ $源$($) specdata,sulfate,1:10)

## [1] 4.064

污染物排放量(specdata,nitrate,70:72)

## [1] 1.706

pollutionantmean(specdata,nitrate,23)

## [1] 1.281

我可以一口气读完整件事:

  path =C:/ Users / Sean / Documents / R Projects / Data / specdata
fileList = list.files(path = path,pattern =\\ .csv $,full.names = T)
all.files.data = lapply(fileList,read.csv,header = TRUE)
DATA = do.call(rbind,all.files .data)

我的问题是:


  1. 用户输入id为原子或范围,例如假设用户输入1,但是文件名是001.csv,或者如果用户输入范围1:10,那么文件名是001.csv ... 010.csv

  2. 列由用户,即他/她有兴趣获得平均值的硫酸盐或硝酸盐...这些列中有很多缺失值(在计算平均值之前,我需要从列中省略)。

所有文件的全部数据如下所示:

  summary(DATA)
硫酸盐硝酸盐的日期ID
2004-01-01:250最小值:0.0最小值:0.0最小值:1.0
2004-01-02:250 1st Qu .: 1.3 1st Qu .: 0.4 1st Qu .: 79.0
2004-01-03:250中位数:2.4中位数:0.8中位数:168.0
2004-01-04:250平均值:3.2平均值:1.7平均值:164.5
2004-01-05:250第三档:4.0第三档:2.0第三档:247.0
2004-01-06:250最大值:35.9最大值:53.9最大。 :332.0
(其他):770587不适用的:653304不适用的:657738

制定这个将高度赞赏...



干杯

解决方案

这是我固定它的方式:

  pollutantmean<  -  function(directory,pollutant,id = 1:332){
#设置路径
path =目录

#获取该目录中的文件$ b $ fileList = list.files(path)

#提取文件名并以数字形式存储以便比较
file.names = as.numeric(sub(\\.csv $,,fileList))

#select文件要根据用户输入或默认
导入selected.files = fileList [match(id,file.names)]

#import data
Data = lapply(file .path(path,selected.files),read.csv)

#转换为数据框
Data = do.call(rbind.data.frame,Data)

#calculate平均
意思是(Data [,pollutant],na.rm = TRUE)

}

最后一个问题是我的函数应该调用specdata(所有csv所在的目录名称)作为目录,r中是否有目录类型对象?



假设我调用这个函数为:

  pollutantmean(specdata,niterate,1:10)

它应该得到specdata目录的路径,它位于我的工作目录中...我该怎么做?


I am trying to write a function in R which takes 3 inputs:

  1. Directory
  2. pollutant
  3. id

I have a directory on my computer full of CSV's files i.e. over 300. What this function would do is shown in the below prototype:

pollutantmean <- function(directory, pollutant, id = 1:332) {
        ## 'directory' is a character vector of length 1 indicating
        ## the location of the CSV files

        ## 'pollutant' is a character vector of length 1 indicating
        ## the name of the pollutant for which we will calculate the
        ## mean; either "sulfate" or "nitrate".

        ## 'id' is an integer vector indicating the monitor ID numbers
        ## to be used

        ## Return the mean of the pollutant across all monitors list
        ## in the 'id' vector (ignoring NA values)
        }

An example output of this function is shown here:

source("pollutantmean.R")
pollutantmean("specdata", "sulfate", 1:10)

## [1] 4.064

pollutantmean("specdata", "nitrate", 70:72)

## [1] 1.706

pollutantmean("specdata", "nitrate", 23)

## [1] 1.281

I can read the whole thing in one go by:

path = "C:/Users/Sean/Documents/R Projects/Data/specdata"
fileList = list.files(path=path,pattern="\\.csv$",full.names=T)
all.files.data = lapply(fileList,read.csv,header=TRUE)
DATA = do.call("rbind",all.files.data)

My issue are:

  1. User enters id either atomic or in a range e.g. suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv
  2. Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.

The whole data from all the files look like this :

summary(DATA)
         Date           sulfate          nitrate             ID       
 2004-01-01:   250   Min.   : 0.0     Min.   : 0.0     Min.   :  1.0  
 2004-01-02:   250   1st Qu.: 1.3     1st Qu.: 0.4     1st Qu.: 79.0  
 2004-01-03:   250   Median : 2.4     Median : 0.8     Median :168.0  
 2004-01-04:   250   Mean   : 3.2     Mean   : 1.7     Mean   :164.5  
 2004-01-05:   250   3rd Qu.: 4.0     3rd Qu.: 2.0     3rd Qu.:247.0  
 2004-01-06:   250   Max.   :35.9     Max.   :53.9     Max.   :332.0  
 (Other)   :770587   NA's   :653304   NA's   :657738

Any idea how to formulate this would be highly appreciated...

Cheers

解决方案

That's the way I fixed it:

pollutantmean <- function(directory, pollutant, id = 1:332) {
    #set the path
    path = directory

    #get the file List in that directory
    fileList = list.files(path)

    #extract the file names and store as numeric for comparison
    file.names = as.numeric(sub("\\.csv$","",fileList))

    #select files to be imported based on the user input or default
    selected.files = fileList[match(id,file.names)]

    #import data
    Data = lapply(file.path(path,selected.files),read.csv)

    #convert into data frame
    Data = do.call(rbind.data.frame,Data)

    #calculate mean
    mean(Data[,pollutant],na.rm=TRUE)

    }

The last question is that my function should call "specdata" (the directory name where all the csv's are located) as the directory, is there a directory type object in r?

suppose i call the function as:

pollutantmean(specdata, "niterate", 1:10)

It should get the path of specdata directory which is on my working directory... how can I do that?

这篇关于根据用户输入读取多个文件并计算平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆