Azure Databricks与ADLA进行处理 [英] Azure Databricks vs ADLA for processing

查看:106
本文介绍了Azure Databricks与ADLA进行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我所有数据文件都存储在Azure Data Lake Store中.我需要处理这些大多数为csv格式的文件.处理将在这些文件上运行作业,以提取各种信息,例如某些日期或某个场景相关事件的数据,或者从多个表/文件中添加数据.这些作业每天通过数据工厂(v1或v2)中的u-sql作业运行,然后发送到powerBI进行可视化.

Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates or certain events related to a scenario or adding data from multiple tables/files. These jobs run everyday through u-sql jobs in data factory(v1 or v2) and then sent to powerBI for visualization.

使用ADLA进行所有处理,我觉得这需要花费很多时间,而且看起来非常昂贵.我建议我将Azure Databricks用于上述过程.有人可以帮助我朝着两者之间的方向发展吗?如果这对转变有所帮助的话?我可以将我的所有U-sql作业修改为Databricks笔记本格式吗?

Using ADLA for all this processing, I feel it takes a lot of time to process and seems very expensive. I got a suggestion that I should use Azure Databricks for the above processes. Could somebody help me with this direction in the difference between the two and if it would be helpful to shift? Can I modify all my U-sql jobs into the Databricks notebook format?

推荐答案

免责声明:我为Databricks工作.

在不知道您要处理多少数据,数据类型是多少或处理时间长的情况下,很难给出优缺点或建议.如果要将Azure的Data Lake Analytics成本与Databricks进行比较,则只能通过与销售团队成员交谈才能准确地完成.

It is tough to give pros/cons or advice without knowing how much data you work with, what kind of data it is, or how long your processing times are. If you want to compare Azure's Data Lake Analytics costs to Databricks, it can only be accurately done through speaking with a member of the sales team.

请记住,ADLA基于YARN群集管理器(来自Hadoop),并且仅运行U-SQL批处理工作负载.来自蓝色花岗岩:

Keep in mind that ADLA is based on YARN cluster manager(from Hadoop) and only runs U-SQL batch processing workloads. A description from blue granite:

ADLA is focused on batch processing, which is great for many Big Data workloads. 
Some example uses for ADLA include, but are not limited to:

- Prepping large amounts of data for insertion into a Data Warehouse
- Processing scraped web data for science and analysis
- Churning through text, and quickly tokenizing to enable context and sentiment analysis
- Using image processing intelligence to quickly process unstructured image data
- Replacing long-running monthly batch processing with shorter running distributed processes

Databricks涵盖批处理和流处理,并处理ETL(数据工程师)和数据科学(机器学习,深度学习)工作负载.通常,这就是公司使用Databricks的原因.

Databricks covers both batch and stream processing, and handles both ETL (data engineer) and Data science (Machine Learning, Deep Learning) workloads. Generally, here is why companies use Databricks.

  • 更快,更可靠,可扩展性更好的Apache Spark™. Databricks创建了自定义版本的Apache Spark™(Databricks运行时),该版本进行了优化,处理速度比原始Apache Spark™高100倍.
  • 消除因设置时间或成本导致的基础架构瓶颈. Databricks在几分钟之内即可创建具有所有必要组件的Apache Spark™集群.在不涉及Ops/DevOps的情况下,安装了Apache Spark™,Python,Scala以及您需要的所有机器学习和深度学习库.群集可以自动扩展以仅在需要时使用额外的资源,未使用的群集将在设置的时间后自动终止,以避免产生不必要的费用.
  • 面向数据工程师和数据科学家的统一分析平台.数据工程师和数据科学团队正在完全独立地工作.沟通不畅,彼此之间的代码和工作缺乏可视性,开发流程效率低下(获取数据,进行清理并准备进行分析). Databricks提供了支持多种语言(SQL,R,Python,Scala等)的协作笔记本,以便这两个小组可以一起工作
  • 消除流式使用案例的复杂性. Databricks有一个名为Delta的新产品,它使您可以保持数据湖的规模,而不会遇到可靠性,性能和数据不一致性问题,这些问题通常是在处理大量流式无模式数据时出现的,而其他人则试图读取从中. Delta可以在Apache Spark™运行时的基础上提高性能,并允许对数据湖中的数据进行upsert等操作(通常很难做到).
  • 企业安全性,支持以及一流的专业知识.加密,访问控制以及其他经过第三方验证的安全性. Databricks贡献了75%的Apache Spark™代码库,因此所提供的知识和专业知识水平要比其他任何地方都要好.这些专业知识可以帮助您优化查询,调整集群,建议如何设置数据管道等.
  • Faster, reliable, and better scaling Apache Spark™. Databricks created a customized version of Apache Spark™ (Databricks Runtime) that has optimizations allowing for as high as 100x faster processing than vanilla Apache Spark™.
  • Removes infrastructure bottlenecks that result from setup time or cost. Databricks creates Apache Spark™ clusters with all the necessary componenents in a few minutes. Apache Spark™, Python, Scala, plus all the Machine Learning and Deep Learning libraries you need are setup without involving Ops/DevOps. Clusters can autoscale to only use extra resources when needed, and unused clusters will auto-terminate after a set time to avoid incurring unnecessary costs.
  • Unified analytics platform for both Data engineers and Data scientists. Data engineers and data science teams are working completely independently. There are miscommunications, lack of visibility into each other's code and work, and inefficiencies in the development pipeline (getting data ingested, cleaned, and ready for analysis). Databricks provides collaborative notebooks that support multiple languages (SQL, R, Python, Scala, etc.) so that these two groups can work together
  • Remove complexities from streaming use cases. Databricks has a new product called Delta that allows you to keep the scale of a data lake, without running into the reliability, performance, and data inconsistency issues that often occur with processing large amounts of streaming schema-less data while others are trying to read from it. Delta provides performance boosts on top of the Apache Spark™ runtime, and allows for things like upserts on data in the data lake (typically extremely difficult to do).
  • Enterprise security, support, plus spark expertise. Encryption, access controls, and more with 3rd party validated security. 75% of the Apache Spark™ codebase is contributed to by Databricks', so level of knowledge and expertise that be provided is better than you would get anywhere else. That expertise could be assistance in optimizing queries, tuning your clusters, recommending how to setup your data pipelines etc.

除了这些原因之外,还有更多原因,但其中一些是最常见的原因.如果您认为这可能对您的情况有所帮助,则应在网站上进行试用.

There's more reasons than those, but those are some of the most common. You should try out a trial on the website if you think it may help your situation.

这篇关于Azure Databricks与ADLA进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆