HDFS中数据可用性的事件通知? [英] Event Notification of Data Availability in HDFS?

查看:82
本文介绍了HDFS中数据可用性的事件通知?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实现Hadoop数据可用性通知系统的最佳方法是什么?当新数据出现时,它会创建一个通知,供作业控制框架使用,以根据数据开始其工作。这里的主要问题是数据一旦可用,作业应该被触发,而不是在NameNode上进行作业轮询以获取数据的可用性? p>我会做的是使用一个生产者/消费者模型,它们可以使用像例如Amazon SQS的队列相互交互。



生产者将维护一个列表观察目录,并且每x秒(其中x应该是参数)执行 hadoop fs -test -e / path / to / watched / dir ,并且如果该命令返回0与 $?然后您可以发送消息到队列。消息的内容可能只是刚刚出现的目录的名称,或者您可以添加一些元数据,并将其作为JSON对象发送,例如使用其他字段。



另一方面,消费者会每y秒钟收听一次队列(其中y应该是一个参数),只要有新数据,您就可以在这个目录中开始工作。


What will be the best approach towards implementing a notification system for Hadoop for data availability such that whenever new data comes its creates a notification which can be utilized by job control framework to start their job which depends on that data. Here the main concern is as soon as the data becomes available the job should get triggered instead job polling on NameNode for availability of data?

解决方案

What I would do is use a producer/consumer model that can interact with each other using a queue like for example Amazon SQS.

The producer will maintain a list of watched directories, and do hadoop fs -test -e /path/to/watched/dir every x seconds (where x should be a parameter), and if the command returns 0 with $? then you can send a message to the queue. The content of the message could be just the name of the directory that just appeared, or you could add some metadata and send it as a JSON object for example with additional fields.

On the other side the consumer will listen to the queue every y seconds (where y should be a parameter), and as soon as there is new data you can start your job on this directory.

这篇关于HDFS中数据可用性的事件通知?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆