流数据和 Hadoop?(不是 Hadoop 流) [英] Streaming data and Hadoop? (not Hadoop Streaming)

查看：36 发布时间：2022/1/13 23:59:41 hadoop mapreduce

本文介绍了流数据和 Hadoop?(不是 Hadoop 流)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用 MapReduce 方法分析连续的数据流(通过 HTTP 访问)，因此我一直在研究 Apache Hadoop.不幸的是，Hadoop 似乎希望以固定大小的输入文件开始作业，而不是能够在新数据到达时将其交给消费者.真的是这样吗，还是我错过了什么?是否有不同的 MapReduce 工具可以处理从打开的套接字读取的数据?可伸缩性是这里的一个问题，所以我宁愿让 MapReducer 处理混乱的并行化问题.

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.

我玩过 Cascading 并且能够在一个通过 HTTP 访问的静态文件，但这实际上并不能解决我的问题.我可以使用 curl 作为中间步骤，将数据转储到 Hadoop 文件系统的某个位置，并编写一个看门狗，以在每次准备好新数据块时启动新作业，但这是一个肮脏的 hack；必须有一些更优雅的方式来做到这一点.有任何想法吗?

I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?

流数据和 Hadoop?(不是 Hadoop 流) [英] Streaming data and Hadoop? (not Hadoop Streaming)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

流数据和 Hadoop?(不是 Hadoop 流) [英] Streaming data and Hadoop? (not Hadoop Streaming)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭