如何编写R来循环集合目录中每个文件的每个工作表 [英] how to write R to loop each worksheet of every files in the set directory

查看:428
本文介绍了如何编写R来循环集合目录中每个文件的每个工作表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个脚本,它工作正常,以获得一个数字在某一列。现在我想收集不仅在目录中每个文件的第一张,而是每个文件的每一张。



现在.csv文件R已经写了2列,列A是文件名,B是抓取的数字R.



我应该在下面的脚本中添加什么修改以使csv输出显示3列,A是文件名,B是表名,C是数字?

  require(xlsx)
#setwd
setwd(D:\\Transferred Files \\ )
files< - (Sys.glob(*。xls))
f <-length(files)

DF < - data.frame = $($)







=
A< -read.xlsx(file = files [i],1,startColumn = 1,endColumn = 20,startRow = 1,endRow = 60)
# as.data.frame.matrix(A)
P <-B其中(apply(B,1,function(x)any(grepl(P,x))) ,2,function(x)any(grepl(P,

x)))+ 6]

#fill price DF
DF [ ] <-c(files [i],P)
}
write.csv(DF,prices.csv,row.names = FALSE)



我试过XLconnet,但不能真正使它工作。

解决方案

你有一个好的开始,但你要问如何添加到文件中的工作表的循环。如果你读取?read.xlsx ,你会看到两个参数,你在代码中使用一个,忽略另一个:

 用法:

read.xlsx(file,sheetIndex,sheetName = NULL,rowIndex = NULL,
startRow = NULL,endRow = NULL,colIndex = NULL,
as.data.frame = TRUE,header = TRUE,colClasses = NA,
keepFormulas = FALSE,encoding =unknown,...)

参数:

file:要读取的文件的路径。

sheetIndex:表示工作簿中工作表索引的数字。

sheetName:带有工作表名称的字符串。

您只需要提供其中一个。



您可能会问:我如何知道工作表中有多少页?( sheetIndex )甚至是什么工作表名称? sheetName )。 ?getSheets 到救援:

 用法:

getSheets(wb)

参数:

wb:由'createWorksheet'或
'loadWorksheet'返回的工作簿对象。

值:

'getSheets'返回一个java对象引用的列表,每个对象引用
到工作表。列表使用工作表名称命名。

您需要使用 loadWorkbook(file)而不是 read.xlsx 以获取工作表名称,但是对手册的一点阅读将为您提供需要切换的信息。 (你可以使用 getSheets(loadWorkbook(file)),但在我的经验中,我尽量避免在同一个脚本多次打开同一个文件,



或者,Hadley的 readxl 包在其简单性,速度和稳定性方面表现出了希望。它有 excel_sheets() read_excel(),应该满足您的需要。 (事实上​​,这是所有它有...简单是一个好东西(tm)。)



编辑::

 库(XLConnect)
##正在加载所需软件包:XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
## Martin Studer [cre],
## Apache软件基金会[ctb,cph](Apache POI,Apache Commons
## Codec),
## Stephen Colebourne [ctb,cph](Joda-Time Java库)
## http://www.mirai-solutions.com,
## http://miraisolutions.wordpress.com
##附加包:'XLConnect'
##以下对象从'package:xlsx'中被屏蔽:
## createFreezePane ,createSheet,createSplitPane,getCellStyle,getSheets,loadWorkbook,removeSheet,saveWorkbook,setCellStyle,setColumnWidth,setRowHeight

wb1< loadWorkbook('Book1.xlsx')
shts1< getSheets
shts1
## [1]OrigSheet2Sheet8Sheet3Sheet4Sheet5Sheet6Sheet7
for(ws in shts1) {
message(ws)#只是宣布自己
dat< - readWorksheet(wb1,ws)
消息(粘贴(dim(dat),collapse ='x'))有意义,不是这个
}
##原价
## 128 x 11
##板2
## 128 x 11
##板8
## 128 x 19
## Sheet3
## 17 x 11
## Sheet4
## 128 x 11
## Sheet5
## 128 x 11
## Sheet6
## 128 x 11
## Sheet7
## 128 x 11

编辑#2



一个更详细的迭代例子:

  library(XLConnect)
for(fn in list.files(pattern =* .xlsx)){
message('Opening:',fn)
wb< - loadWorkbook(fn)
shts< - getSheets(wb)
message(sprintf ('%d Sheets:%s',length(shts),
paste(shts,collapse =',')))
for(sh in shts){
dat < readWorksheet(wb,sh)
##对数据做一些有意义的事情
}
}

我不太确定你在使用你的代码(因为你从来没有说过什么是包含在任何电子表格),但一个替代方法(我将用来代替以前的双 - for 示例)是包含列表中的所有内容:

  dat< ;  -  sapply(list.files(pattern ='*。xlsx'),function(fn){
wb< - loadWorkbook(fn)
sapply(getSheets(wb),function (wb,sh))
})

str(dat,list.len = 2)
## 4的列表
## $ Book1.xlsx: 8
## .. $ Orig:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. $ Sheet2:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [列表输出截断]
## .. [列表输出截断]
## $ Book2。 xlsx:8
的列表## .. $ Orig:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. $ Sheet2:'data.frame':128 obs。 11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. [list output truncated]
## [list output truncated]

如果您不关心从哪个工作簿区分特定工作表,随后简化处理数据 - 然后您可以嵌套嵌套列表到单个列表:

  flatdat < (dat,recur = FALSE)
str(flatdat,list.len = 3)
## 555的列表
## $ Book1.xlsx.Orig:'data.frame':128 obs。的11个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出truncated]
## $ Book1.xlsx.Sheet2:'data.frame':128 obs。的11个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出truncated]
## $ Book1.xlsx.Sheet8:'data.frame':128 obs。的19个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出truncated]
## [list output truncated]

现在,处理数据可能更简单。你的代码寻找P是有点有缺陷,你正在分配一个data.frame到另一个data.frame内的单元格,通常皱眉。



这可能会转向另一个问题给你。为此,我强烈敦促您提供更详细的问题,包括示例工作表的样子和你期望你的输出看起来像。


I have a scripte here and it works fine to get a number in a certain column. Now i want to collect not only on the 1st sheet of every files in the directory, but each sheets of each files.

now the .csv file R has written shows 2 column, column A is filename, and B is the number R grabbed.

What modification should I add to the below script to have a csv output that shows 3 column, A is filename, B is sheetnames, C is the number?

require(xlsx)
#setwd
setwd("D:\\Transferred Files\\")
files <- (Sys.glob("*.xls"))
f<-length(files)

DF <- data.frame(txt=rep("", f),num=rep(NA, f),stringsAsFactors=FALSE)

# files loop
for(i in 1:f)
{
  A<-read.xlsx(file=files[i],1,startColumn=1, endColumn=20, startRow=1, endRow=60)
  #Find price
  B<-as.data.frame.matrix(A)
  P<-B[which(apply(B, 1, function(x) any(grepl("P", x)))),which(apply(B, 2, function(x) any(grepl("P", 

x))))+6]

  #fill price DF
  DF[i, ] <-c(files[i],P)
}
write.csv(DF, "prices.csv", row.names=FALSE)

I tried XLconnet, but can't really make it work into this.

解决方案

You have a good start, but you are asking how to add into the loop the worksheets within a file. If you read ?read.xlsx, you'll see two arguments that you are glossing over (well, using one, ignoring the other) in your code:

Usage:

     read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
       startRow=NULL, endRow=NULL, colIndex=NULL,
       as.data.frame=TRUE, header=TRUE, colClasses=NA,
       keepFormulas=FALSE, encoding="unknown", ...)

Arguments:

    file: the path to the file to read.

sheetIndex: a number representing the sheet index in the workbook.

sheetName: a character string with the sheet name.

You should only need to provide one of the two.

You might ask "how do I know how many sheets there are in a worksheet?" (for sheetIndex) or even "what are the sheet names?" (for sheetName). ?getSheets to the rescue:

Usage:

     getSheets(wb)

Arguments:

      wb: a workbook object as returned by 'createWorksheet' or
          'loadWorksheet'.

Value:

     'getSheets' returns a list of java object references each pointing
     to an worksheet.  The list is named with the sheet names.

You'll need to use loadWorkbook(file) instead of read.xlsx in order to get the sheet names, but a little reading of the manuals will provide you the information you need to switch over. (You can use something like getSheets(loadWorkbook(file)), but in my experience I try to avoid opening the same file multiple times in the same script, regardless of auto-closing.)

As an alternative, Hadley's readxl package shows promise in its simplicity, speed, and stability. It has excel_sheets() and read_excel() that should fill your needs. (In fact, that's all it has ... simplicity is "A Good Thing (tm)".)

Edit:

library(XLConnect)
## Loading required package: XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
##     Codec),
##   Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
## Attaching package: 'XLConnect'
## The following objects are masked from 'package:xlsx':
##     createFreezePane, createSheet, createSplitPane, getCellStyle, getSheets, loadWorkbook, removeSheet, saveWorkbook, setCellStyle, setColumnWidth, setRowHeight

wb1 <- loadWorkbook('Book1.xlsx')
shts1 <- getSheets(wb1)
shts1
## [1] "Orig"   "Sheet2" "Sheet8" "Sheet3" "Sheet4" "Sheet5" "Sheet6" "Sheet7"
for (ws in shts1) {
    message(ws)                              # just announcing myself
    dat <- readWorksheet(wb1, ws)
    message(paste(dim(dat), collapse=' x ')) # do something meaningful, not this
}
## Orig
## 128 x 11
## Sheet2
## 128 x 11
## Sheet8
## 128 x 19
## Sheet3
## 17 x 11
## Sheet4
## 128 x 11
## Sheet5
## 128 x 11
## Sheet6
## 128 x 11
## Sheet7
## 128 x 11

Edit #2:

As a more detailed iterative example:

library(XLConnect)
for (fn in list.files(pattern="*.xlsx")) {
    message('Opening: ', fn)
    wb <- loadWorkbook(fn)
    shts <- getSheets(wb)
    message(sprintf('    %d Sheets: %s', length(shts),
                    paste(shts, collapse=', ')))
    for (sh in shts) {
        dat <- readWorksheet(wb, sh)
        ## do something meaningful with the data
    }        
}

I'm not really sure what you are doing with your code (since you never said what was contained within any of the spreadsheets), but an alternative approach (that I would use in place of the previous double-for example) is to enclose everything within lists:

dat <- sapply(list.files(pattern='*.xlsx'), function(fn) {
    wb <- loadWorkbook(fn)
    sapply(getSheets(wb), function(sh) readWorksheet(wb, sh))
})

str(dat, list.len=2)
## List of 4
##  $ Book1.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##  $ Book2.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##   [list output truncated]

If you don't care about differentiating from which workbook a particular sheet came from -- and subsequently simplify processing the data -- then you can "flatten" the nested lists into a single list:

flatdat <- unlist(dat, recur=FALSE)
str(flatdat, list.len=3)
## List of 555
##  $ Book1.xlsx.Orig  :'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet2:'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet8:'data.frame':   128 obs. of  19 variables:
##   ..$ i    : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x    : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1   : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##   [list output truncated]

Now, processing your data is perhaps simpler. Your code for looking for "P"s is a bit flawed in that you are assigning a data.frame to a cell within another data.frame, typically frowned upon.

This will likely turn in to another question for you. For that, I strongly urge you to provide a better detailed question, including what a sample worksheet looks like and what you expect your output to look like.

这篇关于如何编写R来循环集合目录中每个文件的每个工作表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆