在Sparklyr中导入多个文件 [英] Importing multiple files in sparklyr
问题描述
我对Sparklyr和Spark非常陌生,所以请告诉我这是否不是火花"方式.
I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.
我有50多个.txt文件,每个文件约300 mb,都在同一个文件夹中,称为x
,我需要导入该文件以生成Sparklyr,最好是一张表.
I have 50+ .txt files at around 300 mb each, all in the same folder, call it x
, that I need to import to sparklyr, preferably one table.
我可以像这样分别阅读它们
I can read them individually like
spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)
如果我将它们全部导入sparklyr之外,我可能会创建一个包含文件名的列表,将其命名为filelist
,然后使用lapply将它们全部导入到列表中.
If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist
and then import them all into a list with lapply
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE))
这给了我一个列表,其中元素 k 是filelist
中的 k :th .txt文件.所以我的问题是:sparklyr是否有等效的方法可以做到这一点?
This gives me a list where element k is the k:th .txt file in filelist
. So my question is: is there an equivalent way in sparklyr to do this?
我尝试使用lapply()
和spark_read_csv
,就像我在外部sparklyr中所做的一样.只需将read.table
更改为spark_read_csv
和参数
I've tried to use lapply()
and spark_read_csv
, like I did above outside sparklyr. Just changed read.table
to spark_read_csv
and the arguments
datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))
这给了我一个列表,该列表具有与.txt文件相同数量的元素,但是每个元素(.txt文件)都与文件列表中的最后一个.txt文件相同.
which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.
> identical(datalist[[1]],datalist[[2]])
[1] TRUE
我显然希望每个元素都是数据集之一.我的想法是,在此之后,我可以将它们一起rbind
.
I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind
them together.
找到了一种方法.问题在于,每次读取新文件时,都需要更新spark_read_csv
中的参数名称",否则它将被覆盖.因此,我在for循环中执行了lapply,而不是lapply,并且在每次迭代中都更改了名称.有更好的方法吗?
Found a way. The problem was that the argument "name" in spark_read_csv
needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?
datalist <- list()
for(i in 1:length(filelist)){
name <- paste("dataset",i,sep = "_")
datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
name = name, delimiter="|", header=FALSE)
}
推荐答案
自从你(强调我)
具有50多个.txt文件,每个文件约300 mb,全部位于同一文件夹中
您只能在路径中使用通配符:
you can just use wildcard in the path:
spark_read_csv(
path = "/path/to/folder/*.txt",
sc = sc, name = "mydata", delimiter = "|", header=FALSE)
如果目录仅包含数据,则可以进一步简化此操作:
If directory contains only the data you can simplify this even further:
spark_read_csv(
path = "/path/to/folder/",
sc = sc, name = "mydata", delimiter = "|", header = FALSE)
Native Spark读取器还支持一次读取多个路径(Scala代码):
Native Spark readers also support reading multiple paths at once (Scala code):
spark.read.csv("/some/path", "/other/path")
但是从0.7.0-9014开始,它没有在 sparklyr ( spark_normalize_path
不支持大于1的向量.
but as of 0.7.0-9014 it is not properly implemented in sparklyr (current implementation of spark_normalize_path
doesn't support vectors of size larger than one).
这篇关于在Sparklyr中导入多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!