自动获取Excel工作表的列类型 [英] Get column types of excel sheet automatically

查看:167
本文介绍了自动获取Excel工作表的列类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有多个工作表的excel文件,每个工作表有几列,所以我不想单独指定列的类型,而是自动指定。我想把它们读成 stringsAsFactors = FALSE 会这样做,因为它正确地解释了列的类型。在我当前的方法中,列宽0.492±0.6被解释为数字,返回NA,因为 stringsAsFactors 选项在中不可用read_excel 。所以在这里,我写了一个解决方法,或多或少有效,但我不能在现实生活中使用,因为我不允许创建一个新文件。注意:我需要其他列作为数字或整数,还有其他只有文本作为字符的列,如 stringsAsFactors 在我的 read.csv 示例。

I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors option is not available in read_excel. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors does in my read.csv example.

library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6", 
#being interpreted as number (returns NA)
colna<-colnames(firstread)

# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE, 
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999 
# and several other similar cases.

# read again with stringsAsFactors 
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well


推荐答案

这是一个导入excel文件中所有数据的脚本。它将每个工作表的数据放在列表中,名为 dfs

Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:

library(readxl)

# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")

# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {

  #Get the number of column in current sheet
  col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))

  # Get the dataframe with columns as text
  df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))

  # Convert to data.frame
  df <- as.data.frame(df, stringsAsFactors = FALSE)

  # Get numeric fields by trying to convert them into
  # numeric values. If it returns NA then not a numeric field.
  # Otherwise numeric.
  cond <- apply(df, 2, function(x) {
    x <- x[!is.na(x)]
    all(suppressWarnings(!is.na(as.numeric(x))))
  })
  numeric_cols <- names(df)[cond]
  df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)

  # Return df in desired format
  df
})

# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets

流程如下:

首先,使用 excel_sheets 获取文件中的所有工作表,然后遍历工作表名称以创建数据框。对于每个这些数据帧,您最初通过将 col_types 参数设置为 text >文本。将数据框的列作为文本后,可以将结构从 tibble 转换为 data.frame 。之后,您会找到实际为数字列的列,并将它们转换为数值。

First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.

截至4月底,新版本的 readxl 已经发布, read_excel 函数有两个与此相关的增强功能题。首先,您可以使用函数猜测列类型,并在 col_types 参数中提供参数guess。第二个增强(推论第一个)是 guess_max 参数被添加到 read_excel 函数中。此新参数允许您设置用于猜测列类型的行数。基本上,我上面写的内容可以用以下内容缩短:

As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:

library(readxl)

# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")

dfs <- lapply(all_sheets, function(sheetname) {
    suppressWarnings(read_excel(path = "myfile.xlsx", 
                                sheet = sheetname, 
                                col_types = 'guess', 
                                guess_max = Inf))
})

# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets

我建议您将 readxl 更新到最新版本以缩短您的脚本,从而避免可能的烦恼。

I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.

我希望这会有所帮助。

这篇关于自动获取Excel工作表的列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆