有没有一种方法可以使用Databricks将多个文本文件加载到单个数据框中? [英] Is there a way to load multiple text files into a single dataframe using Databricks?

查看:141
本文介绍了有没有一种方法可以使用Databricks将多个文本文件加载到单个数据框中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试测试一些想法以递归方式遍历文件夹和子文件夹中的所有文件,并将所有内容加载到单个数据框中.我有12种不同的文件,不同之处是基于文件命名约定.因此,我具有以'ABC'开头的文件名,以'CN'开头的文件名,以'CZ'开头的文件名,依此类推.我尝试了以下3个想法.

I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas.

import pyspark  
import os.path
from pyspark.sql import SQLContext
from pyspark.sql.functions import input_file_name

df = sqlContext.read.format("com.databricks.spark.text").option("header", "false").load("dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz")
df.withColumn('input', input_file_name())
print(dfCW)

df = sc.textFile('/mnt/rawdata/2019/06/28/Parent/ABC*.gz')
print(df)

df = sc.sequenceFile('dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz/').toDF()
df.withColumn('input', input_file_name())
print(dfCW)

这可以通过PySpark或PySpark SQL完成.我只需要将所有内容从数据湖加载到数据帧中,以便可以将数据帧推送到Azure SQL Server中.我正在使用Azure Databricks进行所有编码.如果这是普通的Python,我可以轻松完成.我只是不太了解PySpark才能使它正常工作.

This can be done with PySpark or PySpark SQL. I just need to get everything loaded, from a data lake, into a dataframe so I can push the dataframe into Azure SQL Server. I'm doing all coding in Azure Databricks. If this was regular Python, I could do it pretty easily. I just don't know PySpark well enough to get this working.

为了说明这一点,我有3个压缩文件,如下所示(ABC0006.gz,ABC00015.gz和ABC0022.gz):

Just to illustrate the point, I have 3 zipped files that look like this (ABC0006.gz, ABC00015.gz, and ABC0022.gz):

ABC0006.gz
0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993

ABC00015.gz
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003

ABC0022.gz
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

我想将所有内容合并到一个看起来像这样的daddframe中(.gz是文件名;每个文件都具有完全相同的标头):

I want to merge everything into one datdframe that looks like this (the .gz is the name of the file; each file has exactly the same headers):

0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

我有数千个这些文件可以通过.幸运的是,只有12种不同的文件类型,因此还有12种名称...以'ABC','CN','CZ'等开头.

I've got 1000s of these files to get through. Fortunately, there are just 12 distinct types of files and thus 12 types of names...starting with 'ABC', 'CN', 'CZ', etc. Thanks for the look here.

根据您的评论,亚伯拉罕,看来我的代码应该是这样的,对吧...

Based on your comments, Abraham, it seems like my code should look like this, right...

file_list=[]
path = 'dbfs/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

这是正确的还是不正确?请指教.我认为我们已经接近了,但这对我来说仍然行不通,否则我不会在这里重新张贴.谢谢!

Is this correct, or is this not correct? Please advise. I think we are close, but this still doesn't work for me, or I wouldn't be re-posting here. Thanks!!

推荐答案

PySpark支持使用加载功能加载文件列表.我相信这就是您要寻找的

PySpark support loading a list of files using the load function. I believe this is what you are looking for

file_list=[]
path = 'dbfs/mnt/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

如果文件是CSV且具有标题,请使用以下命令

if the files are CSV and has header use the below command

df = spark.read.load(path=file_list,format="csv", sep=",", inferSchema="true", header="true")

有关更多示例代码,请参见 https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

for more example code refer https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

这篇关于有没有一种方法可以使用Databricks将多个文本文件加载到单个数据框中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆