如何编写一个for循环以将多个csv文件读入R并子集数据以为ggplots制作干净的数据帧? [英] How to write a for loop to read multiple csv files into R and subset the data to make clean dataframes for ggplots?

查看:71
本文介绍了如何编写一个for循环以将多个csv文件读入R并子集数据以为ggplots制作干净的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将多个csv读入R,然后通过使用'subset'函数删除不需要的列来对这些csvs进行子集化.我正在尝试在r中设置一个for循环,可以将功能或计算添加到csvs列表中,以便为以后生成ggplots或stat分析提供数据框.(我目前安装了tidyverse,dplyr和ggplot2).现在,我只想对csvs进行子集化,然后从子集化的数据中创建一个数据框.

我使用for循环通过设置工作目录,创建csvs列表,然后将它们读取到数据帧中来成功地将多个csvs读取到单独的数据帧中.当前,这会为以原始文件名命名的每个csv输出一个数据帧:

 文件名<-gsub("\\.csv $",",list.files(pattern ="\\.csv $"))for(i在文件名中){分配(i,read.csv(粘贴(i,".csv",sep =")))} 

然后,我意识到我想在将这些数据放入数据帧之前先对它们进行子集化,以避免以后出现一些重复的代码.但是,每次尝试向for循环添加子集函数时,我都会收到错误消息.这是我目前拥有的:

  for(文件名中的i){read.csv(i)子集(i,select = c(名称",列数",我想要"))分配(i,read.csv(粘贴(i,".csv",sep =")))} 

我收到没有这样的文件或目录错误".我敢肯定我缺少一些明显的东西,因为我的R基础很差,但是任何帮助或建议做这项工作将不胜感激.子集函数过去对我有用,但是我不得不为每个数据帧写一行新记录,并希望通过使用列表和for循环或其他方法来避免这种情况.

谢谢

解决方案

显然,所有csv文件确实具有相同的结构,即,相同的列号和名称.因此, OP自己的答案可以是在几个方面进行了改进:

  1. read.csv()函数读取所有列.因此,需要单独的子设置步骤以仅保留所需的列. data.table 包中的 fread()函数具有一个 select 参数,用于 read 仅来自文件.
  2. rbindlist() do.call(rbind,...)的缩写,但具有附加参数 idcol .这将创建一个附加列,用于标识每行的来源.

创建数据帧列表

  lapply(list.files(pattern ="\\.csv $"),data.table :: fread,select = c("names.of","columns","I.want")) 

  [[1]]我想要的列的名称1:1 2 3[[2]我想要的列的名称1:21 22 23 

请注意,仅从文件中读取选定的列.

创建一个大数据框

 库(data.table)库(magrittr)#此处使用管道以提高可读性lapply(list.files(pattern ="\\.csv $"),fread,select = c("names.of","columns","I.want"))%>%rbindlist(idcol = TRUE) 

I.want的

  .id names.of列1:1 1 2 32:2 21 22 233:2 31 32 33 

请注意, .id 列给出了列表元素的序列号.

使用原始文件名创建一个大数据框

 库(data.table)图书馆(magrittr)文件名= list.files(模式="\\.csv $")lapply(文件名,fread,select = c("names.of","columns","I.want"))%>%set_names(文件名)%>%rbindlist(idcol =起源") 

I.want的

 起源名称.1:test1.csv 1 2 32:test2.csv 21 22 233:test2.csv 31 32 33 

此处, magrittr 包中的 set_names()用于命名列表元素.然后, rbindlist()将列表元素的名称用于id列.

样本数据

如果已创建两个文件.

test1.csv 包含一行和四列:

 "names.of","columns","I.want","useless.column"等.1 2 3 4 

test2.csv 包含两行和五列:

"<名称>",列","I.want","useless.column","other.useless.column"的名称.21、22、23、24、2531、32、33、34、35

请注意,我已经修改了列名以确保它们在语法上是有效的变量名.

I am trying to read multiple csvs into R and then subset those csvs by removing columns I don't need using the 'subset' function. i am trying to setup a for loop in r that I can add functions or calculation to a list of csvs in order to produce dataframes for ggplots or stat analysis later. (I currently have tidyverse, dplyr, and ggplot2 installed). Right now I just want to subset the csvs and then create a dataframe from the subsetted data.

I used a for loop to successfully read multiple csvs into separate dataframes by setting a working directory, creating a list of csvs, then reading them into dataframes. This currently outputs a dataframe for each csv named after the original filename:

filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))


for(i in filenames){
     assign(i, read.csv(paste(i, ".csv", sep="")))}

Then I realized I wanted to subset these data before putting them into the dataframes in order to avoid some repetitive code later; however, I am getting an error each time I tried to add a subset function to the for loop. This is what I currently have:

for(i in filenames){
  read.csv(i)
  subset(i, select = c("names", "of columns", "I want"))
  assign(i, read.csv(paste(i, ".csv", sep="")))
}

I receive a "no such file or directory error". I'm sure I'm missing something obvious since my R foundation is poor, but any help or advice to make this work would be appreciated. The subset function has worked for me in the past but I had to write out a new line for each dataframe and would like to avoid that by using a list and for loop or some other method.

Thank you

解决方案

Apparently, all csv files do have the same structure, i.e., same number and names of columns. Therefore, the suggestion by MrFlick and OP's own answer can be improved in several ways:

  1. The read.csv() function reads all columns. Therefore, a separate subsetting step is required to keep only the wanted columns. The fread() function from the data.table package has a select parameter to read only the wanted columns from file.
  2. rbindlist() is an abbreviation of do.call(rbind, ...) but has an additional parameter idcol. This will create an additional column which identifies the origin of each row.

Create list of data frames

lapply(list.files(pattern = "\\.csv$"), data.table::fread, 
       select = c("names.of", "columns", "I.want"))

[[1]]
   names.of columns I.want
1:        1       2      3

[[2]]
   names.of columns I.want
1:       21      22     23

Note that only selected columns are read from files.

Create one large dataframe

library(data.table)
library(magrittr)   # piping used here to improve readability
lapply(list.files(pattern = "\\.csv$"), fread, select = c("names.of", "columns", "I.want")) %>% 
  rbindlist(idcol = TRUE)

   .id names.of columns I.want
1:   1        1       2      3
2:   2       21      22     23
3:   2       31      32     33

Note that the .id column gives the sequence number of list elements.

Create one large dataframe with originating file names

library(data.table)
library(magrittr)
filenames = list.files(pattern = "\\.csv$")
lapply(filenames, fread, select = c("names.of", "columns", "I.want")) %>% 
  set_names(filenames) %>% 
  rbindlist(idcol = "origin")

      origin names.of columns I.want
1: test1.csv        1       2      3
2: test2.csv       21      22     23
3: test2.csv       31      32     33

Here, set_names() from the magrittr package is used to name the list elements. Then, rbindlist() uses the names of the list elements for the id column.

Sample data

If have created two files.

test1.csv contains one row and four columns:

"names.of", "columns", "I.want", "useless.column"
1, 2, 3, 4

test2.csv contains two rows and five columns:

"names.of", "columns", "I.want", "useless.column", "another.useless.column"
21, 22, 23, 24, 25
31, 32, 33, 34, 35

Note that I have modified the column names to ensure that they are syntactically valid variable names.

这篇关于如何编写一个for循环以将多个csv文件读入R并子集数据以为ggplots制作干净的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆