Databricks读取Azure Blob的上次修改日期 [英] Databricks read Azure blob last modified date

查看:102
本文介绍了Databricks读取Azure Blob的上次修改日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的Databricks hdfs上安装了Azure blob存储. 有没有办法获取数据块中Blob的最后修改日期?

I have an Azure blob storage mounted to my Databricks hdfs. Is there a way to get the last modified date of the blob in databricks?

这是我阅读blob内容的方式:

This is how i'm reading the blob content:

val df = spark.read
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .csv("/mnt/test/*")

推荐答案

通常,有两种方法可以读取Azure Blob最后修改的数据,如下所示.

Generally, there are two ways to read an Azure Blob last modified data, as below.

  1. 通过Azure存储REST API或Java的Azure存储SDK直接阅读. 在研究了Azure Blob存储REST API之后,有两个REST API Get Blob & Get Blob Properties 可以获取Last-Modified来自响应标头的属性.因此,您可以在Scala中调用这些api来解析api响应标头来获取它,或者只是在Scala中使用Java的Azure存储SDK来执行相同的操作.
  1. Directly read it via Azure Storage REST API or Azure Storage SDK for Java. After I researched Azure Blob Storage REST APIs, there are two REST APIs Get Blob & Get Blob Properties which can get the Last-Modified property from the response header. So you can call these apis in Scala to parse api response header to get it, or simply using Azure Storage SDK for Java in Scala to do the same.

这是我在Java中的示例代码,用于获取blob的Last-Modified属性.

Here is my sample code in Java for getting Last-Modified property of a blob.

import java.util.Date;

import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlob;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;

String StorageConnectionStringTemplate = "DefaultEndpointsProtocol=https;" + 
        "DefaultEndpointsProtocol=https;" +
        "AccountName=%s;" +
        "AccountKey=%s";
String accountName = "<your storage account name for HDInsight>";
String accountKey = "<your storage account key for HDInsight>";
String containerName = "<container name for HDFS>";
String blobName = "<blob name>";
String storageConnectionString = String.format(StorageConnectionStringTemplate, accountName, accountKey);
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
CloudBlobClient client = storageAccount.createCloudBlobClient();
CloudBlobContainer container = client.getContainerReference(containerName);
CloudBlob blob = container.getBlobReferenceFromServer(blobName);
Date lastModifiedDate = blob.getProperties().getLastModified();

考虑 Hadoop Azure 基于适用于Java的Azure存储SDK 8.0.0,而不是最新版本的10.0,因此上面的示例代码与

Considering for Hadoop Azure is based on Azure Storage SDK for Java 8.0.0, not a newest version 10.0, so my sample code above is different from the offical tutorial of Azure Blob Storage for Java.

如果要获取容器的Last-Modified属性,则可以使用REST API [Get Container Properties][5]或Java代码Date lastModifiedDate = container.getProperties().getLastModified();.

If you want to get the Last-Modified property of a container, you can use the REST API [Get Container Properties][5] or the Java code Date lastModifiedDate = container.getProperties().getLastModified();.

  1. 将Hadoop Azure Java API用于wasb://协议.

import java.util.Date;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;

Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
Path f = new Path("<blob path on HDFS>");
FileStatus fileStatus = hdfs.getFileStatus(f);
long lastModifiedTime = f.getModificationTime();
Date lastModifiedDate = new Date(lastModifiedTime);

这篇关于Databricks读取Azure Blob的上次修改日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆