R:有没有一种方法可以在读取时对文件进行子集化 [英] R: Is there a way to subset a file while reading
问题描述
我有一个很大的.csv
文件,它的大小约为1.4G,用read.csv
读取需要时间.该文件中有几个变量,而我想要的只是提取特定列中几个变量的数据.
I have a huge .csv
file, its size is ~ 1.4G and reading with read.csv
takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.
例如,假设ABC.csv
是我的文件,它看起来像这样:
For example, suppose ABC.csv
is my file and it looks something like this:
ABC.csv
Date Variables Val
2017-11-01 X 23
2017-11-01 A 2
2017-11-01 B 0.5
............................
2017-11-02 X 20
2017-11-02 C 40
............................
2017-11-03 D 33
2017-11-03 X 22
............................
............................
因此,这里的目标变量是X
,在读取此文件时,我希望扫描df$Variables
,仅读取此列中具有X
字符串的行.这样我的新数据将如下所示:
So , here the variable of interest is X
and while reading this file i want the df$Variables
to be scanned reading only the rows with X
string in this column. So that my new data from will look something like this:
> df
Date Variables Val
2017-11-01 X 23
2017-11-02 X 20
.........................
.........................
任何帮助将不胜感激.先感谢您.
Any Help will be appreciated. Thank you in advance.
推荐答案
签出LaF
包,它允许以块为单位读取非常大的文本文件,因此您不必将整个文件都读取到内存中./p>
Check out the LaF
package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.
library(LaF)
data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file
block_list <- lapply(seq(1,100000,1000), function(row_num){
goto(dat, row_num)
data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
data_block <- data_block[data_block$Variables == "X",]
return(data_block)
})
your_df <- do.call("rbind", block_list)
诚然,该程序包有时会显得有些笨拙,在某些情况下,我必须找到一些小技巧才能获得结果(您可能必须针对数据修改我的解决方案).不过,我发现它是处理超出RAM的文件的非常有用的解决方案.
Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.
这篇关于R:有没有一种方法可以在读取时对文件进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!