如何使用R语言处理50GB的大型CSV文件? [英] How to deal with a 50GB large csv file in r language?

查看:492
本文介绍了如何使用R语言处理50GB的大型CSV文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里的大数据处理中相对较新,希望能找到有关如何处理50 GB csv文件的建议。当前问题如下:

I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:

表如下所示:

ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file

中创建了一个原始索引我想找到所有的行属于加利福尼亚旧金山。应该是一个简单的问题,但是csv太大。

I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.

我知道我在R中有两种方法,还有另一种使用数据库来处理它的方法。 :

I know I have two ways of doing it in R and another way to use database to handle it:

(1)使用R的ffdf包:

(1) Using R's ffdf packages:

自上次保存文件以来,它一直在使用write.csv,它包含所有不同的类型。

since last time the file is saved, it was using write.csv and it contains all different types.

all <- read.csv.ffdf(
  file="<path of large file>", 
  sep = ",",
  header=TRUE, 
  VERBOSE=TRUE, 
  first.rows=10000, 
  next.rows=50000,
  )

控制台会显示以下内容:

the console gives me this:

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  
: vmode 'character' not implemented

在网上搜索时,我发现了一些不适合我的情况的答案,而且我对如何转移字符一无所知

Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.

然后我尝试使用read.table.ffdf,这更是灾难,我找不到可靠的指南

Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.

(2)使用R的readline:

(2) Using R's readline:

我知道这是另一种好方法,但是可以没找到有效的方法。

I know this is another good way, but can't find an effecient way to do this.

(3)使用SQL:

我不确定将文件传输到SQ L版本以及如何处理此问题(如果有很好的指导,我想尝试一下)。但总的来说,我想坚持使用R。

I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.

感谢您的回复和帮助!

推荐答案

您可以在带有sqldf软件包的幕后使用R和SQLite。您可以在 sqldf 包中使用 read.csv.sql 函数,然后您可以查询数据想要获得较小的数据框。

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

文档示例:

library(sqldf)

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")

我在非常大的CSV文件上使用了该库,效果很好。

I've used this library on VERY large CSV files with good results.

这篇关于如何使用R语言处理50GB的大型CSV文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆