Apache NiFi和StreamSet之间的区别 [英] Difference between Apache NiFi and StreamSets

查看:811
本文介绍了Apache NiFi和StreamSet之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算做一个课堂项目,并且正在研究几种技术,这些技术可以自动化或设置系统之间的数据流,并发现其中有几对,即Apache NiFi和StreamSets(据我所知).我不明白的是它们和可以使用它们的用例之间的区别?我对此并不陌生,如果有人能向我解释一下,将不胜感激.谢谢

解决方案

苏拉(Suraj)

好问题.

我的回应是作为开源Apache NiFi项目管理委员会的成员以及对数据流管理领域充满热情的人.

自2006年启动以来,我一直参与NiFi项目.我对Streamsets的了解相对有限,所以我会让他们自己说出来.

要了解的关键是,NiFi的构建是要非常好地完成一件非常重要的事情,那就是数据流管理".它的设计基于一个称为基于流的编程"的概念,您可能想阅读该项目并为项目' https://引用. en.wikipedia.org/wiki/Flow-based_programming '

已经有许多产生数据的系统,例如传感器等.有许多系统专注于数据处理,例如Apache Storm,Spark,Flink等.最后,有许多系统可以存储HDFS,关系数据库等数据. NiFi完全专注于连接这些系统并提供用户体验和完成此过程所需的核心功能的任务.

为使其有效而进行的一些关键功能和设计选择是什么?

1)交互式命令和控制

尝试连接系统的人的工作是能够快速有效地与他们看到的恒定数据流进行交互. NiFi的UI可以让您做到这一点,即在数据流转时,您可以添加功能以对其进行操作,派生数据副本以尝试新方法,调整当前设置,查看最新和历史统计数据,有用的联机文档等.相比之下,几乎所有其他系统都具有面向设计和部署的模型,这意味着您需要进行一系列更改,然后进行部署.该模型很好并且可以直观地进行,但是对于数据流管理工作而言,这意味着您不会通过更改反馈来获得交互式更改,而更改反馈对于快速构建新流程或安全有效地纠正或改善现有数据流的处理至关重要.

2)数据来源

NiFi的一项非常独特的功能是它能够生成细粒度且功能强大的可追溯性详细信息,以详细说明数据的来源,处理的内容,发送的位置以及流程中的完成时间.出于多种原因,这对于有效的数据流管理至关重要,但是对于处于早期探索阶段并从事项目工作的人员而言,这给您带来的最重要的事情是出色的调试灵活性.您可以设置流程并让其运行,然后使用出处来实际证明它确实满足您的要求.如果未按预期发生任何事情,则可以修复流并重播对象,然后重复执行.真的很有帮助.

3)专用数据存储库

即使在真正适度的硬件或虚拟环境中,NiFi的开箱即用体验也提供了非常强大的性能.这是因为流文件和内容存储库设计为我们提供了高性能,但是随着数据在流中的运行,我们需要事务性语义.流文件存储库是一个简单的预写日志实现,内容存储库提供了一个不变的版本化内容存储库.反过来,这意味着我们只能通过添加新指针来复制"数据(实际上不复制字节),也可以通过简单地读取原始内容并写出新版本来转换数据.再次非常有效.再加上我刚才提到的出处资料,它提供了一个非常强大的平台.在这里要理解的另一个真正关键的事情是,在连接系统的业务中,您并不总是能够决定诸如所涉及数据的大小之类的事情. NiFi API的建立就是为了兑现这一事实,因此我们的API使处理器能够执行诸如接收,转换和发送数据之类的事情,而无需将完整的对象加载到内存中.这些存储库还意味着,在大多数流程中,大多数处理器甚至根本不会触摸内容.但是,您可以轻松地从NiFi UI中准确地看到实际正在读取或写入的字节数,因此再次获得了有关建立和观察流的有用信息.这种设计还意味着NiFi可以自然地支持背压和压力释放,而这些对于数据流管理系统来说确实是至关重要的功能.

Streamsets公司的人们先前曾提到NiFi是面向文件的.我不太确定通用的文件,记录,元组,对象或消息之间有什么区别,但现实是,数据在流中时,这是需要管理的事情,发表'.这就是NiFi所做的.无论您有很多真正的高速小物件还是大型物件,它们是来自Internet上的实时音频流,还是来自硬盘上的文件,都没有关系.一旦进入流程,就该进行管理和交付了.这就是NiFi所做的.

Streamsets公司还提到NiFi是无模式的.准确地说,NiFi不会强制将数据从最初的格式转换为某种特殊的NiFi格式,也不必为了后续交付而将其重新转换为某种格式.如果这样做的话,那将是非常不幸的,因为这意味着即使是最琐碎的案件也会对性能造成影响,幸运的是,NiFi并没有这个问题.如果我们走了那条路,那将意味着处理诸如媒体(图像,视频,音频等)之类的各种数据集将很困难,但是我们走在正确的轨道上,并且NiFi一直用于此类事情.

最后,当您继续进行项目时,如果发现有什么希望改进的地方,或者想贡献代码,我们很乐意为您提供帮助.在 https://nifi.apache.org 中,您可以快速找到有关如何提交票证,提交补丁,通过电子邮件发送邮件列表以及更多.

最近有几个有趣的NiFi项目需要结帐: https://www.linkedin.com/pulse/nifi- ocr使用Apache阅读儿童书籍杰里米·代尔(Jeremy-Dyer) https://twitter.com/KayLerch/status/721455415456882689

祝您好运!如果您有任何疑问,users @ nifi.apache.org邮件列表将很乐意为您提供帮助.

谢谢 乔

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks

解决方案

Suraj,

Great question.

My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.

I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.

The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'

There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.

What are some of those key functions and design choices made to make that effective:

1) Interactive command and control

The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.

2) Data Provenance

A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.

3) Purpose built data repositories

NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.

It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.

It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.

Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.

Here are a couple of fun recent NiFi projects to checkout: https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer https://twitter.com/KayLerch/status/721455415456882689

Good luck on the class project! If you have any questions the users@nifi.apache.org mailing list would love to help.

Thanks Joe

这篇关于Apache NiFi和StreamSet之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆