从Azure Databricks中的Azure Datalake Gen2读取.nc文件 [英] Read .nc files from Azure Datalake Gen2 in Azure Databricks

查看:66
本文介绍了从Azure Databricks中的Azure Datalake Gen2读取.nc文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试读取Azure Databricks中的.nc(netCDF4)文件.

从未使用过.nc文件

  1. 所有必需的.nc文件都在Azure Datalake Gen2中
  2. 将上述文件安装到"/mnt/eco_dailyRain "处的Databricks中
  3. 可以使用 dbutils.fs.ls("/mnt/eco_dailyRain")列出安装的内容输出:

      Out [76]:[FileInfo(path ='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc',name ='2000.daily_rain.nc',size = 429390127),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc',name ='2001.daily_rain.nc',size = 428217143),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc',name ='2002.daily_rain.nc',size = 428218181),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc',name ='2003.daily_rain.nc',size = 428217139),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc',name ='2004.daily_rain.nc',size = 429390143),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc',name ='2005.daily_rain.nc',size = 428217137),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc',name ='2006.daily_rain.nc',size = 428217127),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc',name ='2007.daily_rain.nc',size = 428217143),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc',name ='2008.daily_rain.nc',size = 429390137),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc',name ='2009.daily_rain.nc',size = 428217127),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc',name ='2010.daily_rain.nc',size = 428217134),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc',name ='2011.daily_rain.nc',size = 428218181),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc',name ='2012.daily_rain.nc',size = 429390127),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc',name ='2013.daily_rain.nc',size = 428217143),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc',name ='2014.daily_rain.nc',size = 428218104),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc',name ='2015.daily_rain.nc',size = 428217134),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc',name ='2016.daily_rain.nc',size = 429390127),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc',name ='2017.daily_rain.nc',size = 428217223),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc',name ='2018.daily_rain.nc',size = 418143765),FileInfo(path ='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc',name ='2019.daily_rain.nc',size = 370034113),FileInfo(path ='dbfs:/mnt/eco_dailyRain/Consignments.parquet',name ='Consignments.parquet',size = 237709917),FileInfo(path ='dbfs:/mnt/eco_dailyRain/test.nc',name ='test.nc',size = 428217137)] 

只需测试一下是否可以从安装读取.

  spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet') 

确认可以读取实木复合地板文件.

输出

  Out [83]:DataFrame [CONSIGNMENT_PK:int,CERTIFICATE_NO:string,ACTOR_NAME:string,GENERATOR_FK:int,TRANSPORTER_FK:int,RECEIVER_FK:int,REC_POST_CODE:string,WASTEDESC:string,WASTE_FK:int,:字符串,VOLUME:整数,MEASURE:字符串,WASTE_TYPE:字符串,WASTE_ADD:字符串,CONTAMINENT1_FK:整数,CONTAMINENT2_FK:整数,CONTAMINENT3_FK:整数,CONTAMINENT4_FK:整数,TREATMENT_FK:整数,ANZSIRE​​G_F1:,VEH2_REGNO:字符串,VEH2_LICNO:字符串,GEN_SIGNEE:字符串,GEN_DATE:时间戳,TRANS_SIGNEE:字符串,TRANS_DATE:时间戳,REC_SIGNEE:字符串,REC_DATE:时间戳,DATECREATED:时间戳,DISCREPANCY:字符串,APPROVAL_NUMBER:int,REC_WASTE_TYPE:字符串,REC_VOLUME:int,REC_MEASURE:字符串,DATE_RECEIVED:时间戳,DATE_SCANNED:时间戳,HAS_IMAGE:字符串,LASTMODIFIED:时间戳) 

但是尝试读取 netCDF4 文件说没有这样的文件或目录

代码:

  import datetime as dt#Python标准库datetime模块将numpy导入为np从netCDF4导入数据集#http://code.google.com/p/netcdf4-python/导入matplotlib.pyplot作为pltrootgrp =数据集("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r",格式="NETCDF4") 

错误

  FileNotFoundError:[错误2]没有这样的文件或目录:b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc' 

任何线索.

解决方案

根据 netCDF4模块的API参考用于类

数据集 path 参数的值应该是unix目录格式的路径,但是路径 dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc 是PySpark的一种格式,众所周知,因此出现错误 FileNotFoundError:[Errno 2]没有这样的文件或目录:b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'.

解决此问题的方法是更改​​路径值 dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc 并使用等价的Unix路径/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc ,代码如下.

  rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r",format ="NETCDF4") 

您可以通过下面的代码对其进行检查以查看它.

 %shls/dbfs/mnt/eco_dailyRain 

当然,您也可以通过 dbutils.fs.ls('/mnt/eco_dailyRain')列出netCDF4格式的数据文件.

Trying to read .nc (netCDF4) files in Azure Databricks.

Never worked with .nc files

  1. All the required .nc files are in Azure Datalake Gen2
  2. Mounted above files into Databricks at "/mnt/eco_dailyRain"
  3. Can list the content of mount using dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT:

    Out[76]: [FileInfo(path='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc', name='2000.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc', name='2001.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc', name='2002.daily_rain.nc', size=428218181),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc', name='2003.daily_rain.nc', size=428217139),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc', name='2004.daily_rain.nc', size=429390143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc', name='2005.daily_rain.nc', size=428217137),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc', name='2006.daily_rain.nc', size=428217127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc', name='2007.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc', name='2008.daily_rain.nc', size=429390137),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc', name='2009.daily_rain.nc', size=428217127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc', name='2010.daily_rain.nc', size=428217134),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc', name='2011.daily_rain.nc', size=428218181),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc', name='2012.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc', name='2013.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc', name='2014.daily_rain.nc', size=428218104),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc', name='2015.daily_rain.nc', size=428217134),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc', name='2016.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc', name='2017.daily_rain.nc', size=428217223),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc', name='2018.daily_rain.nc', size=418143765),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc', name='2019.daily_rain.nc', size=370034113),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/Consignments.parquet', name='Consignments.parquet', size=237709917),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/test.nc', name='test.nc', size=428217137)]
    

Just to test wether can read from mount.

spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet')

confirms can read parquet file.

output

Out[83]: DataFrame[CONSIGNMENT_PK: int, CERTIFICATE_NO: string, ACTOR_NAME: string, GENERATOR_FK: int, TRANSPORTER_FK: int, RECEIVER_FK: int, REC_POST_CODE: string, WASTEDESC: string, WASTE_FK: int, GEN_LICNUM: string, VOLUME: int, MEASURE: string, WASTE_TYPE: string, WASTE_ADD: string, CONTAMINENT1_FK: int, CONTAMINENT2_FK: int, CONTAMINENT3_FK: int, CONTAMINENT4_FK: int, TREATMENT_FK: int, ANZSICODE_FK: int, VEH1_REGNO: string, VEH1_LICNO: string, VEH2_REGNO: string, VEH2_LICNO: string, GEN_SIGNEE: string, GEN_DATE: timestamp, TRANS_SIGNEE: string, TRANS_DATE: timestamp, REC_SIGNEE: string, REC_DATE: timestamp, DATECREATED: timestamp, DISCREPANCY: string, APPROVAL_NUMBER: string, TR_TYPE: string, REC_WASTE_FK: int, REC_WASTE_TYPE: string, REC_VOLUME: int, REC_MEASURE: string, DATE_RECEIVED: timestamp, DATE_SCANNED: timestamp, HAS_IMAGE: string, LASTMODIFIED: timestamp]

But trying to read netCDF4 files says No such file or directory

Code:

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/
import matplotlib.pyplot as plt

rootgrp = Dataset("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

Error

FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

Any clues.

解决方案

According to the API reference of netCDF4 module for class Dataset, as the figure below.

The value of the path parameter for Dataset should be a path of unix directory format, but the path dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc is a format for PySpark as I known, so you got the error FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'.

The solution to fix it is to change the path value dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc with the equivalence unix path /dbfs/mnt/eco_dailyRain/2001.daily_rain.nc, the code as below.

rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

You can check it via the code below to see it.

%sh
ls /dbfs/mnt/eco_dailyRain

Ofcouse, you also can list your data files of netCDF4 format via dbutils.fs.ls('/mnt/eco_dailyRain') if you had mount it.

这篇关于从Azure Databricks中的Azure Datalake Gen2读取.nc文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆