使用按需HD Insight群集从Azure Datafactory V2访问Datalake [英] Access datalake from Azure datafactory V2 using on demand HD Insight cluster

查看：130 发布时间：2020/9/17 0:05:01 python pyspark azure-hdinsight azure-data-factory azure-data-lake

本文介绍了使用按需HD Insight群集从Azure Datafactory V2访问Datalake的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Azure数据工厂从点播HD Insight群集中执行Spark作业.

I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.

文档清楚地表明，ADF(v2)不支持针对点播HD洞察集群的datalake链接服务，因此必须从复制活动中将数据复制到blob上，然后再执行该作业.但是，如果在一个数据湖上有十亿个文件，这种解决方案似乎是非常昂贵的资源.是否有任何有效的方法可以通过执行spark作业的python脚本访问datalake文件，也可以通过任何其他直接访问这些文件的方法来实现.

Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.

P.S是否有可能在v1中做类似的事情，如果可以，那么怎么办? 使用Azure数据工厂在HDInsight中创建按需Hadoop群集"描述了访问blob存储的按需hadoop集群，但我希望访问datalake的按需Spark集群.

P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.

P.P.s预先感谢

使用按需HD Insight群集从Azure Datafactory V2访问Datalake [英] Access datalake from Azure datafactory V2 using on demand HD Insight cluster

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用按需HD Insight群集从Azure Datafactory V2访问Datalake [英] Access datalake from Azure datafactory V2 using on demand HD Insight cluster

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭