有没有办法使用 Databricks 将多个文本文件加载到单个数据框中? [英] Is there a way to load multiple text files into a single dataframe using Databricks?

查看:32
本文介绍了有没有办法使用 Databricks 将多个文本文件加载到单个数据框中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试测试一些想法,以递归地遍历文件夹和子文件夹中的所有文件,并将所有内容加载到单个数据帧中.我有 12 种不同类型的文件,不同之处在于文件命名约定.因此,我有以ABC"开头的文件名、以CN"开头的文件名、以CZ"开头的文件名,等等.我尝试了以下 3 个想法.

I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas.

import pyspark  
import os.path
from pyspark.sql import SQLContext
from pyspark.sql.functions import input_file_name

df = sqlContext.read.format("com.databricks.spark.text").option("header", "false").load("dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz")
df.withColumn('input', input_file_name())
print(dfCW)

df = sc.textFile('/mnt/rawdata/2019/06/28/Parent/ABC*.gz')
print(df)

df = sc.sequenceFile('dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz/').toDF()
df.withColumn('input', input_file_name())
print(dfCW)

这可以通过 PySpark 或 PySpark SQL 来完成.我只需要将所有内容从数据湖加载到数据帧中,以便将数据帧推送到 Azure SQL Server.我正在 Azure Databricks 中进行所有编码.如果这是普通的 Python,我可以很容易地做到这一点.我只是不太了解 PySpark,无法让它发挥作用.

This can be done with PySpark or PySpark SQL. I just need to get everything loaded, from a data lake, into a dataframe so I can push the dataframe into Azure SQL Server. I'm doing all coding in Azure Databricks. If this was regular Python, I could do it pretty easily. I just don't know PySpark well enough to get this working.

为了说明这一点,我有 3 个类似这样的压缩文件(ABC0006.gz、ABC00015.gz 和 ABC0022.gz):

Just to illustrate the point, I have 3 zipped files that look like this (ABC0006.gz, ABC00015.gz, and ABC0022.gz):

ABC0006.gz
0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993

ABC00015.gz
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003

ABC0022.gz
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

我想将所有内容合并为一个如下所示的 datdframe(.gz 是文件名;每个文件具有完全相同的标题):

I want to merge everything into one datdframe that looks like this (the .gz is the name of the file; each file has exactly the same headers):

0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

我有 1000 个这些文件需要处理.幸运的是,只有 12 种不同类型的文件,因此只有 12 种名称……以ABC"、CN"、CZ"等开头.感谢您的浏览.

I've got 1000s of these files to get through. Fortunately, there are just 12 distinct types of files and thus 12 types of names...starting with 'ABC', 'CN', 'CZ', etc. Thanks for the look here.

根据你的评论,Abraham,我的代码看起来应该是这样的,对吧...

Based on your comments, Abraham, it seems like my code should look like this, right...

file_list=[]
path = 'dbfs/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

这是正确的,还是不正确的?请指教.我认为我们很接近,但这对我仍然不起作用,否则我不会在这里重新发布.谢谢!!

Is this correct, or is this not correct? Please advise. I think we are close, but this still doesn't work for me, or I wouldn't be re-posting here. Thanks!!

推荐答案

PySpark 支持使用 load 函数加载文件列表.我相信这就是你要找的

PySpark support loading a list of files using the load function. I believe this is what you are looking for

file_list=[]
path = 'dbfs/mnt/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

如果文件是 CSV 并且有标题,请使用以下命令

if the files are CSV and has header use the below command

df = spark.read.load(path=file_list,format="csv", sep=",", inferSchema="true", header="true")

更多示例代码参考https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

这篇关于有没有办法使用 Databricks 将多个文本文件加载到单个数据框中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆