R:有没有一种方法可以在读取时对文件进行子集化 [英] R: Is there a way to subset a file while reading

查看:50
本文介绍了R:有没有一种方法可以在读取时对文件进行子集化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的.csv文件,它的大小约为1.4G,用read.csv读取需要时间.该文件中有几个变量,而我想要的只是提取特定列中几个变量的数据.

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.

例如,假设ABC.csv是我的文件,它看起来像这样:

For example, suppose ABC.csv is my file and it looks something like this:

   ABC.csv
     Date       Variables   Val
   2017-11-01   X           23  
   2017-11-01   A           2
   2017-11-01   B           0.5
   ............................
   2017-11-02   X           20
   2017-11-02   C           40
   ............................
   2017-11-03   D           33
   2017-11-03   X           22   
   ............................
   ............................

因此,这里的目标变量是X,在读取此文件时,我希望扫描df$Variables,仅读取此列中具有X字符串的行.这样我的新数据将如下所示:

So , here the variable of interest is X and while reading this file i want the df$Variables to be scanned reading only the rows with X string in this column. So that my new data from will look something like this:

 > df 
  Date    Variables   Val
2017-11-01    X       23
2017-11-02    X       20
.........................
......................... 

任何帮助将不胜感激.先感谢您.

Any Help will be appreciated. Thank you in advance.

推荐答案

签出LaF包,它允许以块为单位读取非常大的文本文件,因此您不必将整个文件都读取到内存中./p>

Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.

library(LaF)

data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file

block_list <- lapply(seq(1,100000,1000), function(row_num){
    goto(dat, row_num)
    data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
    data_block <- data_block[data_block$Variables == "X",]
    return(data_block)
})
your_df <- do.call("rbind", block_list)

诚然,该程序包有时会显得有些笨拙,在某些情况下,我必须找到一些小技巧才能获得结果(您可能必须针对数据修改我的解决方案).不过,我发现它是处理超出RAM的文件的非常有用的解决方案.

Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.

这篇关于R:有没有一种方法可以在读取时对文件进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆