使用R从Microsoft Azure读取CSV文件 [英] Reading csv files from microsoft Azure using R

查看:51
本文介绍了使用R从Microsoft Azure读取CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始处理数据块和Azure.

我有Microsoft Azure Storage Explorer.我在databricks上运行了一个jar程序会在azp storgae资源管理器中的路径中输出许多csv文件

  ..../myfolder/subfolder/output/old/p/ 

我通常要做的是转到文件夹 p 并下载所有csv文件右键单击 p 文件夹,然后在我的本地驱动器上单击 download 并使用R中的这些csv文件进行任何分析.

我的问题是,有时我的跑步可能会生成超过10000个csv文件将其下载到本地驱动器会花费很多时间.

我想知道是否有一个教程/R包可以帮助我阅读来自上面路径的csv文件,而无需下载它们.例如有什么办法可以设置

  ..../myfolder/subfolder/output/old/p/ 

作为我的工作目录,并以与我相同的方式处理所有文件.

路径的完整网址如下所示:

 <代码> https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/ 

解决方案

根据官方文档

或者,我使用R包 reticulate 和Python包 azure-storage-blob 直接从带有Azure Blob存储的sas令牌的blob URL中读取csv文件.

这是我的步骤,如下所示.

  1. 我在Azure Databricks工作区中创建了一个R笔记本.
  2. 要通过代码 install.packages("reticulate")安装R包 reticulate .

  3. 要安装Python软件包 azure-storage-blob 作为下面的代码.

     %shpip安装azure-storage-blob 

  4. 要运行Python脚本以生成容器级别的sas令牌并使用它来获取带有sas令牌的blob网址的列表,请参见下面的代码.

     库(网状)py_run_string(从azure.storage.blob.baseblobservice导入BaseBlobService从azure.storage.blob导入BlobPermissions从datetime导入datetime,timedeltaaccount_name ='<您的存储帐户名称>'account_key ='<您的存储帐户密钥>'container_name ='<您的容器名称>'blob_service = BaseBlobService(account_name =帐户名,account_key =帐户_密钥)sas_token = blob_service.generate_container_shared_access_signature(容器名称,权限= BlobPermissions.READ,到期时间= datetime.utcnow()+ timedelta(hours = 1))blob_names = blob_service.list_blob_names(container_name,前缀='myfolder/')blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+ sas_token for blob_names中的blob_name]")blob_urls_with_sas<-py $ blob_urls_with_sas 

  5. 现在,我可以在R中使用不同的方式从带有sas令牌的blob URL中读取csv文件,如下所示.

    5.1. df<-read.csv(blob_urls_with_sas [[1]])

    5.2.使用R包 data.table

      install.packages("data.table")库(data.table)df<-fread(blob_urls_with_sas [[1]]) 

    5.3.使用R包 reader

      install.packages("readr")图书馆(读者)df<-read_csv(blob_urls_with_sas [[1]]) 

注意:对于 reticulate 库,请参阅RStudio文章

I have recently started working with databricks and azure.

I have microsoft azure storage explorer. I ran a jar program on databricks which outputs many csv files in the azure storgae explorer in the path

..../myfolder/subfolder/output/old/p/ 

The usual thing I do is to go the folder p and download all the csv files by right clicking the p folder and click download on my local drive and these csv files in R to do any analysis.

My issue is that sometimes my runs could generate more than 10000 csv files whose downloading to the local drive takes lot of time.

I wondered if there is a tutorial/R package which helps me to read in the csv files from the path above without downloading them. For e.g. is there any way I can set

..../myfolder/subfolder/output/old/p/  

as my working directory and process all the files in the same way I do.

EDIT: the full url to the path looks something like this:

https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/

解决方案

According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.

Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.

Here is my steps as below.

  1. I created a R notebook in Azure Databricks workspace.
  2. To install R package reticulate via code install.packages("reticulate").

  3. To install Python package azure-storage-blob as the code below.

    %sh
    pip install azure-storage-blob
    

  4. To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.

    library(reticulate)
    py_run_string("
    from azure.storage.blob.baseblobservice import BaseBlobService
    from azure.storage.blob import BlobPermissions
    from datetime import datetime, timedelta
    
    account_name = '<your storage account name>'
    account_key = '<your storage account key>'
    container_name = '<your container name>'
    
    blob_service = BaseBlobService(
        account_name=account_name,
        account_key=account_key
    )
    
    sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
    
    blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
    blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
    ")
    blob_urls_with_sas <- py$blob_urls_with_sas
    

  5. Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.

    5.1. df <- read.csv(blob_urls_with_sas[[1]])

    5.2. Using R package data.table

    install.packages("data.table")
    library(data.table)
    df <- fread(blob_urls_with_sas[[1]])
    

    5.3. Using R package readr

    install.packages("readr")
    library(readr)
    df <- read_csv(blob_urls_with_sas[[1]])
    

Note: for reticulate library, please refer to the RStudio article Calling Python from R.

Hope it helps.


Update for your quick question:

这篇关于使用R从Microsoft Azure读取CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆