流数据和Hadoop? (不是Hadoop流媒体) [英] Streaming data and Hadoop? (not Hadoop Streaming)

查看:123
本文介绍了流数据和Hadoop? (不是Hadoop流媒体)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用MapReduce方法分析连续的数据流(通过HTTP访问),所以我一直在研究Apache Hadoop。不幸的是,Hadoop似乎希望用一个固定大小的输入文件开始一项工作,而不是在消费者到达时能够将新数据交给消费者。这是真的吗,还是我错过了什么?是否有不同的MapReduce工具可用于从开放套接字读入数据?可伸缩性是一个问题,所以我宁愿让MapReducer处理混乱的并行化问题。



我玩过 Cascading ,并且能够在通过HTTP访问的静态文件上运行作业,但这实际上并不能解决我的问题。我可以使用curl作为一个中间步骤,将数据转储到Hadoop文件系统的某个地方,并在每次新的数据块准备就绪时编写一个监视器来发出新的作业,但这是一种肮脏的破解;必须有一些更优雅的方式来做到这一点。有任何想法吗?

解决方案

你描述的破解或多或少是做事情的标准方式 - Hadoop基本上是一个批处理系统一方面,如果数据没有终点,Reducers永远无法启动,因为它们必须在地图阶段结束后启动)。

旋转您的日志;当你旋转它们时,将它们转储到HDFS中。有一个看门狗进程(可能是分布式进程,使用ZooKeeper进行协调)监视倾倒场并启动新的处理作业。您需要确保作业在足够大的输入上运行以保证开销。

Hbase是hadoop生态系统中的BigTable克隆,可能会对您有所帮助,例如它允许连续的插入流;不过,您仍然需要在批处理模式下运行分析查询。


I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.

I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?

解决方案

The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).

Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.

Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.

这篇关于流数据和Hadoop? (不是Hadoop流媒体)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆