在Sparklyr中导入多个文件 [英] Importing multiple files in sparklyr

查看:121
本文介绍了在Sparklyr中导入多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Sparklyr和Spark非常陌生,所以请告诉我这是否不是火花"方式.

I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.

我有50多个.txt文件,每个文件约300 mb,都在同一个文件夹中,称为x,我需要导入该文件以生成Sparklyr,最好是一张表.

I have 50+ .txt files at around 300 mb each, all in the same folder, call it x, that I need to import to sparklyr, preferably one table.

我可以像这样分别阅读它们

I can read them individually like

spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)

如果我将它们全部导入sparklyr之外,我可能会创建一个包含文件名的列表,将其命名为filelist,然后使用lapply将它们全部导入到列表中.

If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import them all into a list with lapply

filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE)) 

这给了我一个列表,其中元素 k filelist中的 k :th .txt文件.所以我的问题是:sparklyr是否有等​​效的方法可以做到这一点?

This gives me a list where element k is the k:th .txt file in filelist. So my question is: is there an equivalent way in sparklyr to do this?

我尝试使用lapply()spark_read_csv,就像我在外部sparklyr中所做的一样.只需将read.table更改为spark_read_csv和参数

I've tried to use lapply()and spark_read_csv, like I did above outside sparklyr. Just changed read.table to spark_read_csv and the arguments

datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))

这给了我一个列表,该列表具有与.txt文件相同数量的元素,但是每个元素(.txt文件)都与文件列表中的最后一个.txt文件相同.

which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.

> identical(datalist[[1]],datalist[[2]])
[1] TRUE

我显然希望每个元素都是数据集之一.我的想法是,在此之后,我可以将它们一起rbind.

I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind them together.

找到了一种方法.问题在于,每次读取新文件时,都需要更新spark_read_csv中的参数名称",否则它将被覆盖.因此,我在for循环中执行了lapply,而不是lapply,并且在每次迭代中都更改了名称.有更好的方法吗?

Found a way. The problem was that the argument "name" in spark_read_csv needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?

datalist <- list()
for(i in 1:length(filelist)){
  name <- paste("dataset",i,sep = "_")
  datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
  name = name, delimiter="|", header=FALSE)
}

推荐答案

自从你(强调我)

具有50多个.txt文件,每个文件约300 mb,全部位于同一文件夹中

您只能在路径中使用通配符:

you can just use wildcard in the path:

spark_read_csv(
  path = "/path/to/folder/*.txt",
  sc = sc, name = "mydata", delimiter = "|", header=FALSE) 

如果目录仅包含数据,则可以进一步简化此操作:

If directory contains only the data you can simplify this even further:

spark_read_csv(
  path = "/path/to/folder/",
  sc = sc, name = "mydata", delimiter = "|", header = FALSE)

Native Spark读取器还支持一次读取多个路径(Scala代码):

Native Spark readers also support reading multiple paths at once (Scala code):

spark.read.csv("/some/path", "/other/path")

但是从0.7.0-9014开始,它没有在( spark_normalize_path 不支持大于1的向量.

but as of 0.7.0-9014 it is not properly implemented in sparklyr (current implementation of spark_normalize_path doesn't support vectors of size larger than one).

这篇关于在Sparklyr中导入多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆