如何写R来循环设置目录中每个文件的每个工作表 [英] how to write R to loop each worksheet of every files in the set directory

查看:901
本文介绍了如何写R来循环设置目录中每个文件的每个工作表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个脚本,在一列中可以找到一个数字。现在我不仅要收集目录中的每个文件的第一张表,而且还可以收集每个文件的每张文件。



现在.csv文件R已经写了2个列,列A是文件名,B是被抓取的数字R.



在下面的脚本中添加一个csv输出,显示3列,A是文件名,B是表格名称,C是数字吗? >

  require(xlsx)
#setwd
setwd(D:\\Transferred Files\\ )
文件< - (Sys.glob(*。xls))
f <-length(文件)

DF< - data.frame(txt = rep(,f),num = rep(NA,f),stringsAsFactors = FALSE)

#文件循环
(i in 1:f)
{
A< -read.xlsx(file = files [i],1,startColumn = 1,endColumn = 20,startRow = 1,endRow = 60)
#Find价格
B < (A)
P <-B(其中(适用于(B,1,function(x)any(grepl(P,x)))), ,2,function(x)any(grepl(P,

x)))+ 6]

#fill price DF
DF [i, ]< -c(files [i],P)
}
write.csv(DF,prices.csv,row.names = FALSE)

我尝试过XLconnet,但实际上并不能使它成功。

解决案

您有一个好的开始,但你问如何在一个文件中加入到环中的工作表。如果您阅读?read.xlsx ,您会看到两个参数,您在代码中看到了两个参数(以及使用一个,忽略另一个):

 用法:

read.xlsx(file,sheetIndex,sheetName = NULL,rowIndex = NULL,
startRow = NULL,endRow = NULL,colIndex = NULL,
as.data.frame = TRUE,header = TRUE,colClasses = NA,
keepFormulas = FALSE,encoding =unknown,...)

参数:

文件:要读取的文件的路径。

sheetIndex:表示工作簿中工作表索引的数字。

sheetName:带有工作表名称的字符串。

您只需要提供两者之一。



您可能会问:我如何知道工作表中有多少张?(对于 sheetIndex )或甚至什么是表格名称?(对于 sheetName )。 ?getSheets 拯救:

 用法:

getSheets(wb)

参数:

wb:由'createWorksheet'返回的工作簿对象或
'loadWorksheet'。

值:

'getSheets'返回一个java对象引用的列表,每个对象引用都指向
到工作表。该列表以工作表名称命名。

您需要使用 loadWorkbook(file)而不是 read.xlsx ,以获取工作表名称,但是手册的一点阅读将为您提供切换所需的信息。 (您可以使用类似 getSheets(loadWorkbook(file))的东西,但根据我的经验,我尽量避免在同一个脚本中多次打开相同的文件,关闭。)



另外,Hadley的 readxl 包在其简单,速度和稳定性方面表现出希望。它有 excel_sheets() read_excel(),以满足您的需求。 (事实上​​,这是所有它有...简单是A Good Thing(tm)。)



编辑

 库(XLConnect)
##加载所需的包:XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
## Martin Studer [cre],
## Apache Software Foundation [ctb,cph](Apache POI,Apache Commons
## Codec),
## Stephen Colebourne [ctb,cph](Joda-Time Java库)
## http://www.mirai-solutions.com,
## http://miraisolutions.wordpress.com
##附件包:'XLConnect'
##以下对象从'package:xlsx'中被掩码:
## createFreezePane ,createSheet,createSplitPane,getCellStyle,getSheets,loadWorkbook,removeSheet,saveWorkbook,setCellStyle,setColumnWidth,setRowHeight

wb1< - loadWorkbook('Book1.xlsx')
shts1< - getSheets wb1)
shts1
## [1]OrigSheet2Sheet8Sheet3Sheet4Sheet5 Sheet6Sheet7
for(ws in shts1){
message(ws)#只是宣布自己
dat< - readWorksheet(wb1,ws)
message(paste dim(dat),collapse ='x'))#做一些有意义的事情,不是这个
}
## Orig
## 128 x 11
## Sheet2
## 128 x 11
## Sheet8
## 128 x 19
## Sheet3
## 17 x 11
## Sheet4
# #128 x 11
## Sheet5
## 128 x 11
## Sheet6
## 128 x 11
## Sheet7
## 128 x 11

编辑#2 / p>

作为一个更详细的迭代示例:

 库(XLConnect)
for(fn in list.files(pattern =*。xlsx)){
message('Opening:',fn)
wb< - loadWorkbook(fn)
shts< - getSheets(wb)
message(sprintf('%d Sheets:%s',length(shts),
paste(shts,collapse =',')))
for(sh in shts){
da t < - readWorksheet(wb,sh)
##对数据做一些有意义的事情
}
}

我不太确定你在使用代码的方式(因为你从来没有说过任何电子表格中包含的内容),而是一种替代方法以前的双重 - for 示例)将所有内容都包含在列表中:

  dat < -  sapply(list.files(pattern ='*。xlsx'),function(fn){
wb< - loadWorkbook(fn)
sapply(getSheets(wb) (sh)readWorksheet(wb,sh))
})

str(dat,list.len = 2)
##列表4
## $ Book1.xlsx:列表8
## .. $ Orig:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. $ Sheet2:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. [list output truncated]
## $ Book2。 xlsx:列表8
## .. $ Orig:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [list output truncated]
## .. $ Sheet2:'data.frame':128 obs。的11个变量:
## .. .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. $ x:num [1:128] 1606527 7484 437881 1601729 1341668 ...
## .. .. [列表输出截断]
## .. [列表输出截断]
## [列表输出截断]

如果您不在乎区分特定工作表来自哪个工作簿,而随后简化处理数据 - 然后您可以将嵌套列表展平为单个列表:

  flatdat<  -  unlist (dat,recur = FALSE)
str(flatdat,list.len = 3)
##列表555
## $ Book1.xlsx.Orig:'data.frame':128 OBS。的11个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出truncated]
## $ Book1.xlsx.Sheet2:'data.frame':128 obs。的11个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出截断]
## $ Book1.xlsx.Sheet8:'data.frame':128 obs。 19个变量:
## .. $ i:num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## .. $ x:num [1:128 ] 1606527 7484 437881 1601729 1341668 ...
## .. $ c1:num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
## .. [列表输出截断]
## [列表输出截断]

现在,处理您的数据可能更简单。你的代码寻找P是有点有缺陷的是,你将一个数据框架分配给一个单元格在另一个数据框架,通常皱起眉头。



这可能会给你带来另一个问题。为此,我强烈要求您提供更详细的问题,包括示例工作表的样子以及您期望输出的样子。


I have a scripte here and it works fine to get a number in a certain column. Now i want to collect not only on the 1st sheet of every files in the directory, but each sheets of each files.

now the .csv file R has written shows 2 column, column A is filename, and B is the number R grabbed.

What modification should I add to the below script to have a csv output that shows 3 column, A is filename, B is sheetnames, C is the number?

require(xlsx)
#setwd
setwd("D:\\Transferred Files\\")
files <- (Sys.glob("*.xls"))
f<-length(files)

DF <- data.frame(txt=rep("", f),num=rep(NA, f),stringsAsFactors=FALSE)

# files loop
for(i in 1:f)
{
  A<-read.xlsx(file=files[i],1,startColumn=1, endColumn=20, startRow=1, endRow=60)
  #Find price
  B<-as.data.frame.matrix(A)
  P<-B[which(apply(B, 1, function(x) any(grepl("P", x)))),which(apply(B, 2, function(x) any(grepl("P", 

x))))+6]

  #fill price DF
  DF[i, ] <-c(files[i],P)
}
write.csv(DF, "prices.csv", row.names=FALSE)

I tried XLconnet, but can't really make it work into this.

解决方案

You have a good start, but you are asking how to add into the loop the worksheets within a file. If you read ?read.xlsx, you'll see two arguments that you are glossing over (well, using one, ignoring the other) in your code:

Usage:

     read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
       startRow=NULL, endRow=NULL, colIndex=NULL,
       as.data.frame=TRUE, header=TRUE, colClasses=NA,
       keepFormulas=FALSE, encoding="unknown", ...)

Arguments:

    file: the path to the file to read.

sheetIndex: a number representing the sheet index in the workbook.

sheetName: a character string with the sheet name.

You should only need to provide one of the two.

You might ask "how do I know how many sheets there are in a worksheet?" (for sheetIndex) or even "what are the sheet names?" (for sheetName). ?getSheets to the rescue:

Usage:

     getSheets(wb)

Arguments:

      wb: a workbook object as returned by 'createWorksheet' or
          'loadWorksheet'.

Value:

     'getSheets' returns a list of java object references each pointing
     to an worksheet.  The list is named with the sheet names.

You'll need to use loadWorkbook(file) instead of read.xlsx in order to get the sheet names, but a little reading of the manuals will provide you the information you need to switch over. (You can use something like getSheets(loadWorkbook(file)), but in my experience I try to avoid opening the same file multiple times in the same script, regardless of auto-closing.)

As an alternative, Hadley's readxl package shows promise in its simplicity, speed, and stability. It has excel_sheets() and read_excel() that should fill your needs. (In fact, that's all it has ... simplicity is "A Good Thing (tm)".)

Edit:

library(XLConnect)
## Loading required package: XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
##     Codec),
##   Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
## Attaching package: 'XLConnect'
## The following objects are masked from 'package:xlsx':
##     createFreezePane, createSheet, createSplitPane, getCellStyle, getSheets, loadWorkbook, removeSheet, saveWorkbook, setCellStyle, setColumnWidth, setRowHeight

wb1 <- loadWorkbook('Book1.xlsx')
shts1 <- getSheets(wb1)
shts1
## [1] "Orig"   "Sheet2" "Sheet8" "Sheet3" "Sheet4" "Sheet5" "Sheet6" "Sheet7"
for (ws in shts1) {
    message(ws)                              # just announcing myself
    dat <- readWorksheet(wb1, ws)
    message(paste(dim(dat), collapse=' x ')) # do something meaningful, not this
}
## Orig
## 128 x 11
## Sheet2
## 128 x 11
## Sheet8
## 128 x 19
## Sheet3
## 17 x 11
## Sheet4
## 128 x 11
## Sheet5
## 128 x 11
## Sheet6
## 128 x 11
## Sheet7
## 128 x 11

Edit #2:

As a more detailed iterative example:

library(XLConnect)
for (fn in list.files(pattern="*.xlsx")) {
    message('Opening: ', fn)
    wb <- loadWorkbook(fn)
    shts <- getSheets(wb)
    message(sprintf('    %d Sheets: %s', length(shts),
                    paste(shts, collapse=', ')))
    for (sh in shts) {
        dat <- readWorksheet(wb, sh)
        ## do something meaningful with the data
    }        
}

I'm not really sure what you are doing with your code (since you never said what was contained within any of the spreadsheets), but an alternative approach (that I would use in place of the previous double-for example) is to enclose everything within lists:

dat <- sapply(list.files(pattern='*.xlsx'), function(fn) {
    wb <- loadWorkbook(fn)
    sapply(getSheets(wb), function(sh) readWorksheet(wb, sh))
})

str(dat, list.len=2)
## List of 4
##  $ Book1.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##  $ Book2.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##   [list output truncated]

If you don't care about differentiating from which workbook a particular sheet came from -- and subsequently simplify processing the data -- then you can "flatten" the nested lists into a single list:

flatdat <- unlist(dat, recur=FALSE)
str(flatdat, list.len=3)
## List of 555
##  $ Book1.xlsx.Orig  :'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet2:'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet8:'data.frame':   128 obs. of  19 variables:
##   ..$ i    : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x    : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1   : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##   [list output truncated]

Now, processing your data is perhaps simpler. Your code for looking for "P"s is a bit flawed in that you are assigning a data.frame to a cell within another data.frame, typically frowned upon.

This will likely turn in to another question for you. For that, I strongly urge you to provide a better detailed question, including what a sample worksheet looks like and what you expect your output to look like.

这篇关于如何写R来循环设置目录中每个文件的每个工作表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆