将 xlsx 文件读入 R 的快速方法 [英] Fast way to read xlsx files into R
问题描述
这是
options(scipen=999) # 无科学数字格式nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)pp <- c(1, 5, 10, 20, 30, 40, 50)# 创建一些excel文件l <- list() # 保存结果tmp_dir <- tempdir()for (n in nn) {for (p in pp) {姓名 <-cat("\n\tn:", n, "p:", p)冲洗控制台()m <- 矩阵(rnorm(n*p), n, p)文件 <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")# 写write.xlsx(m, 文件)# 读elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p,elapsed = elapsed, stringsAsFactors = F, row.names = NULL)l <- 追加(l,列表(df))elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p,elapsed = elapsed, stringsAsFactors = F, row.names = NULL)l <- 追加(l,列表(df))}}# 结果d <- do.call(rbind, l)图书馆(ggplot2)ggplot(d, aes(n, elapsed, color= fun)) +geom_line() + geom_point() +facet_wrap( ~ paste("columns:", p)) +xlab("行数") +ylab("秒")
It is a follow-up question to this one. What is the fastest way to read .xlsx
files into R ?
I use library(xlsx)
to read in data from 36 .xlsx
files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx
into R? Or can I put the 36 files into a single csv file quickly and then read into R?
Moreover, I just realised that readxl
cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?
"Response to those voted this question down":
This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.
Further update:
Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.
Here is a small benchmark test. Results: readxl::read_xlsx
on average about twice as fast as openxlsx::read.xlsx
across different number of rows (n
) and columns (p
) using standard settings.
options(scipen=999) # no scientific number format
nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)
# create some excel files
l <- list() # save results
tmp_dir <- tempdir()
for (n in nn) {
for (p in pp) {
name <-
cat("\n\tn:", n, "p:", p)
flush.console()
m <- matrix(rnorm(n*p), n, p)
file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")
# write
write.xlsx(m, file)
# read
elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p,
elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
l <- append(l, list(df))
elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p,
elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
l <- append(l, list(df))
}
}
# results
d <- do.call(rbind, l)
library(ggplot2)
ggplot(d, aes(n, elapsed, color= fun)) +
geom_line() + geom_point() +
facet_wrap( ~ paste("columns:", p)) +
xlab("Number of rows") +
ylab("Seconds")
这篇关于将 xlsx 文件读入 R 的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!