读取多个Excel文件,添加一列,然后进行绑定 [英] Reading in multiple excel files, adding a column, then binding

查看:78
本文介绍了读取多个Excel文件,添加一列,然后进行绑定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列要读入R的Excel文件,基于文件名添加日期列,然后绑定在一起.

I have a series of Excel files that I want to read into R, add in a date column based on the file name, then bind together.

文件的命名约定为User_Info_Jan,User_Info_Feb,User_Info_Mar.仅在文件名中引用月份,而在实际文件本身中并未实际提及.User_Info_Jan文件的示例如下:

The naming convention of the files is User_Info_Jan, User_Info_Feb, User_Info_Mar. The month is only referenced in the name of the file and not actually mentioned in the actual file itself. An example of what the User_Info_Jan files looks like:

ID   Name
ABC  Joe Smith
DEF  Henry Cooper 
ZCS  Kelly Ma

有没有一种方法可以使用文件名中的模式(模式= User_Info_)读取文件,然后添加一个名为"Month"的列,指示文件绑定到哪个月之前?

Is there a way I can read the files in using the pattern in the file name (pattern = User_Info_), then add a column called "Month" indicating what month the file is for, before binding together?

月份"列之后的示例数据"框:

Sample Data frame after month column:

ID   Name           Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January

绑定在一起后的示例数据帧:

Sample data frame after binding together:

ID   Name           Usage Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January
KFY  Lisa Schwartz  February
LFG  Alex Shah      March

推荐答案

我将使用伪造的文件名进行演示,但是我建议您运行的实际命令以相同的结构被注释掉.我假设 .xlsx 用于"excel文件",但这与 .csv 一样好(只需更新模式).

I'll demonstrate with fake filenames, but the real commands I suggest you run are commented out with the same structure. I'm assuming .xlsx for "excel files", but this works equally well with .csv (just update the pattern).

# files <- list.files(path = ".", pattern = "User_Info_.*\\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\\.xlsx", files, list(month = ""))
monthnames
#   month
# 1   Jan
# 2   Feb
# 3   Mar

至此,我们已经从每个文件名中提取月份名称.我发现 strcapture (在R中)比 gsub 好,因为如果没有匹配项,后者会返回整个字符串.基本R中的另一个替代方法是 regmatches(files,gregexpr(...)),但这似乎比这里需要的要复杂一些.另一个替代方法是 stringr :: str_extract ,如果您已经在使用 stringr 和/或其他tidyverse程序包,则可能更直观.

At this point, we've extracted the month name from each filename. I find strcapture (in base R) better than gsub, as the latter returns the entire string if there are no matches; another alternative in base R is regmatches(files, gregexpr(...)), but that seems a bit more complicated than it needs to be here. Another alternative is stringr::str_extract which might be more intuitive if you're already using stringr and/or other tidyverse packages.

从这里开始,我们可以遍历文件以读取它们.

From here, we can iterate over the files to read them in.

# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4   Jan
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   Jan
# $Feb
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Mazda RX4        21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Feb
# Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2   Feb
# $Mar
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

将这些帧列表合并为一个帧是直接的:

Combining that list-of-frames into a single frame is direct:

do.call(rbind, out)
#                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Jan
# Jan.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Jan
# Feb.Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   Feb
# Feb.Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   Feb
# Mar.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Mar.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

所有方法的替代方法可以使用 data.table :: rbindlist dplyr :: bind_rows 并分配"id";直接列:

An alternative to all of that can use data.table::rbindlist or dplyr::bind_rows, and assign the "id" column directly:

# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)

data.table::rbindlist(out, idcol = "month")
#     month   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:    Jan  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
# 2:    Jan  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 3:    Feb  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 4:    Feb  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
# 5:    Mar  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# 6:    Mar  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1

dplyr::bind_rows(out, .id = "month")
#                   month  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Chrysler Imperial   Jan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Hornet Sportabout   Jan 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Mazda RX4           Feb 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Pontiac Firebird    Feb 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Merc 280            Mar 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Hornet 4 Drive      Mar 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

后两个起作用是因为当我较早调用 Map 时,传递给内部函数的第一个参数( monthnames $ month )被用作的名称.> list 输出,这就是为什么您将 $ Jan 等视为返回列表的元素的原因. rbindlist bind_rows 都使用这些名称作为"id",使用 idcol = / .id = 时的列.(如果实际上没有名称",则两个函数都沿它们计数.)

The latter two work because when I called Map earlier, the first argument (monthnames$month) passed to the inner function is used as the names for the list output, which is why you see $Jan etc as the elements of the returned list. Both rbindlist and bind_rows use those names as "id" columns when idcol=/.id= are used. (If no "names" are actually present, both functions count along them.)

这篇关于读取多个Excel文件,添加一列,然后进行绑定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆