自动获取Excel工作表的列类型 [英] Get column types of excel sheet automatically
问题描述
我有一个带有多个工作表的excel文件,每个工作表有几列,所以我不想单独指定列的类型,而是自动指定。我想把它们读成 stringsAsFactors = FALSE
会这样做,因为它正确地解释了列的类型。在我当前的方法中,列宽0.492±0.6被解释为数字,返回NA,因为 stringsAsFactors
选项在中不可用read_excel
。所以在这里,我写了一个解决方法,或多或少有效,但我不能在现实生活中使用,因为我不允许创建一个新文件。注意:我需要其他列作为数字或整数,还有其他只有文本作为字符的列,如 stringsAsFactors
在我的 read.csv $中c $ c>示例。
I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE
would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors
option is not available in read_excel
. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors
does in my read.csv
example.
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
推荐答案
这是一个导入excel文件中所有数据的脚本。它将每个工作表的数据放在列表
中,名为 dfs
:
Here is a script that imports all the data in your excel file. It puts each sheet's data in a list
called dfs
:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
流程如下:
首先,使用 excel_sheets
获取文件中的所有工作表,然后遍历工作表名称以创建数据框。对于每个这些数据帧,您最初通过将 col_types
参数设置为 text
>文本。将数据框的列作为文本后,可以将结构从 tibble
转换为 data.frame
。之后,您会找到实际为数字列的列,并将它们转换为数值。
First, you get all the sheets in the file with excel_sheets
and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text
by setting the col_types
parameter to text
. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble
to a data.frame
. After that, you then find columns that are actually numeric columns and convert them into numeric values.
截至4月底,新版本的 readxl
已经发布, read_excel
函数有两个与此相关的增强功能题。首先,您可以使用函数猜测列类型,并在 col_types
参数中提供参数guess。第二个增强(推论第一个)是 guess_max
参数被添加到 read_excel
函数中。此新参数允许您设置用于猜测列类型的行数。基本上,我上面写的内容可以用以下内容缩短:
As of late April, a new version of readxl
got released, and the read_excel
function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types
parameter. The second enhancement (corollary to the first) is that guess_max
parameter got added to the read_excel
function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
我建议您将 readxl
更新到最新版本以缩短您的脚本,从而避免可能的烦恼。
I would recommend that you update readxl
to the latest version to shorten your script and as a result avoid possible annoyances.
我希望这会有所帮助。
这篇关于自动获取Excel工作表的列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!