流数据和 Hadoop?(不是 Hadoop 流) [英] Streaming data and Hadoop? (not Hadoop Streaming)

查看:36
本文介绍了流数据和 Hadoop?(不是 Hadoop 流)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 MapReduce 方法分析连续的数据流(通过 HTTP 访问),因此我一直在研究 Apache Hadoop.不幸的是,Hadoop 似乎希望以固定大小的输入文件开始作业,而不是能够在新数据到达时将其交给消费者.真的是这样吗,还是我错过了什么?是否有不同的 MapReduce 工具可以处理从打开的套接字读取的数据?可伸缩性是这里的一个问题,所以我宁愿让 MapReducer 处理混乱的并行化问题.

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.

我玩过 Cascading 并且能够在一个通过 HTTP 访问的静态文件,但这实际上并不能解决我的问题.我可以使用 curl 作为中间步骤,将数据转储到 Hadoop 文件系统的某个位置,并编写一个看门狗,以在每次准备好新数据块时启动新作业,但这是一个肮脏的 hack;必须有一些更优雅的方式来做到这一点.有任何想法吗?

I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?

推荐答案

你描述的 hack 或多或少是标准的做事方式——Hadoop 本质上是一个面向批处理的系统(一方面,如果没有数据结束后,Reducers 永远无法启动,因为它们必须在 map 阶段完成后启动).

The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).

轮换你的日志;当您将它们旋转出来时,将它们转储到 HDFS 中.有一个看门狗进程(可能是分布式的,使用 ZooKeeper 进行协调)监控垃圾场并启动新的处理作业.您需要确保作业在足够大的输入上运行以保证开销.

Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.

Hbase 是 hadoop 生态系统中的 BigTable 克隆,您可能会感兴趣,因为它允许连续插入流;但是,您仍然需要以批处理模式运行分析查询.

Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.

这篇关于流数据和 Hadoop?(不是 Hadoop 流)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆