将 xlsx 文件读入 R 的快速方法 [英] Fast way to read xlsx files into R

查看:32
本文介绍了将 xlsx 文件读入 R 的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是

options(scipen=999) # 无科学数字格式nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)pp <- c(1, 5, 10, 20, 30, 40, 50)# 创建一些excel文件l <- list() # 保存结果tmp_dir <- tempdir()for (n in nn) {for (p in pp) {姓名 <-cat("\n\tn:", n, "p:", p)冲洗控制台()m <- 矩阵(rnorm(n*p), n, p)文件 <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")# 写write.xlsx(m, 文件)# 读elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p,elapsed = elapsed, stringsAsFactors = F, row.names = NULL)l <- 追加(l,列表(df))elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p,elapsed = elapsed, stringsAsFactors = F, row.names = NULL)l <- 追加(l,列表(df))}}# 结果d <- do.call(rbind, l)图书馆(ggplot2)ggplot(d, aes(n, elapsed, color= fun)) +geom_line() + geom_point() +facet_wrap( ~ paste("columns:", p)) +xlab("行数") +ylab("秒")

It is a follow-up question to this one. What is the fastest way to read .xlsx files into R ?

I use library(xlsx) to read in data from 36 .xlsx files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx into R? Or can I put the 36 files into a single csv file quickly and then read into R?

Moreover, I just realised that readxl cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?

"Response to those voted this question down":

This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.

Further update:

Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.

解决方案

Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

options(scipen=999)  # no scientific number format

nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)

# create some excel files
l <- list()  # save results
tmp_dir <- tempdir()

for (n in nn) {
  for (p in pp) {
    name <-  
    cat("\n\tn:", n, "p:", p)
    flush.console()
    m <- matrix(rnorm(n*p), n, p)
    file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")

    # write
    write.xlsx(m, file)

    # read
    elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
    df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

    elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
    df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

  }
}

# results 
d <- do.call(rbind, l)

library(ggplot2)

ggplot(d, aes(n, elapsed, color= fun)) + 
  geom_line() + geom_point() +  
  facet_wrap( ~ paste("columns:", p)) +
  xlab("Number of rows") +
  ylab("Seconds")

这篇关于将 xlsx 文件读入 R 的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆