如何在R中循环浏览CSV文件的文件夹 [英] How to loop through a folder of CSV files in R

查看:145
本文介绍了如何在R中循环浏览CSV文件的文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件夹,其中包含一堆名为"yob1980","yob1981","yob1982"等的CSV文件.

I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.

我必须使用for循环来遍历每个文件并将其内容放入数据框-数据框中的列应为"1980","1981","1982"等

I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc

这就是我所拥有的:

file_list <- list.files()

temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))

names(babynames) <- c("Name", "Gender", "Count")

我感觉我需要一个for循环,但是我不确定如何循环遍历文件.有人指出我正确的方向吗?

I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?

推荐答案

我最喜欢的方法是使用plyr包中的ldply.它具有返回数据帧的优点,因此您以后无需执行rbind步骤:

My favourite way to do this is using ldply from the plyr package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

另一个好处是,您可以非常轻松地对导入进行多线程处理,从而大大加快了导入大型多文件数据集的速度:

As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

略微更改上面的内容以在结果数据框中包括一个Year列,您可以先创建一个函数,然后以与执行read.csv

Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

这将为您提供简洁明了的数据,这就是我建议从此处开始的方式.然后可以将每年的数据分成自己的列,但这可能不是最好的方法.

This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.

注意:根据您的喜好,将Year列转换为integer类可能是一个好主意.但这取决于你.

Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. But that's up to you.

这篇关于如何在R中循环浏览CSV文件的文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆