Apache Kafka + Spark 集成(需要 REST API?) [英] Apache Kafka + Spark Integration (REST API is needed?)

查看:28
本文介绍了Apache Kafka + Spark 集成(需要 REST API?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遇到一些基本问题,希望有人能解决.

所以我想在我的应用程序中使用 Apache Kafka 和 Apache spark.我浏览了大量教程,对它是什么以及它将如何工作有了基本的了解.

用例:

数据将以 40 秒的间隔从移动设备(多个设备,比如 1000 个)生成,我需要处理该数据并将值添加到数据库中,然后将这些值反映在仪表板中.

我想要做的是使用 Apache Streams 并从 android 本身发出一个 post 请求,然后这些数据将由 spark 应用程序处理,就是这样.

问题:

  • Apache Spark

我正在按照本教程启动并运行它.(我使用的是 JAVA,而不是 Scala)链接:

在我经历过的所有教程中,有一个这样的页面:https://i.stack.imgur.com/gF1fN.png 但我不明白.是不是spark没有安装好?

现在,当我想部署一个独立的 jar 来触发时,(使用此链接:http://data-scientist-in-training.blogspot.in/2015/03/apache-spark-cluster-deployment-part-1.html ) 能够运行它.即使用命令: spark-submit --class SimpleApp.SimpleApp --master spark://http://161.xxx.xxx.xxx:7077 --name "try"/opt/spark/bin/try-0.0.1-SNAPSHOT.jar ,我得到输出.

如果我想使用它,我需要每次提交申请吗?

这是我的程序:

包 SimpleApp;/* SimpleApp.java */导入 org.apache.spark.api.java.*;导入 org.apache.spark.SparkConf;导入 org.apache.spark.api.java.function.Function;公共类 SimpleApp {公共静态无效主(字符串 [] args){String logFile = "/opt/spark/README.md";//应该是你系统上的某个文件SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");JavaSparkContext sc = new JavaSparkContext(conf);//System.setProperty("hadoop.home.dir", "C:/winutil");sc.setLogLevel("错误");//不想要 INFO 的东西JavaRDDlogData = sc.textFile(logFile).cache();long numAs = logData.filter(new Function() {public Boolean call(String s) { return s.contains("a");}}).数数();long numBs = logData.filter(new Function() {public Boolean call(String s) { return s.contains("b");}}).数数();System.out.println("带有a的行:" + numAs + ",带有b的行:" + numBs);System.out.println("字数:"+logData.first());sc.stop();}}

  • 现在如何将 Kafka 集成到其中?

  • 如何以每次都执行的方式配置应用程序kafka 收到消息了吗?

  • 此外,我是否需要制作一个 REST API,我需要通过它发送kafka 的数据,即 REST api 将用作生产者?像spark Java框架之类的东西?http://sparkjava.com/

  • 如果是,瓶颈将再次发生在 REST api 级别,即如何它可以处理或不处理的许多请求,因为我到处都读到Kafka 的吞吐量非常高.

  • 最终的结构是否会像 SPARK JAVA -> KAFKA -> APACHE火花?

  • 最后如何在本地设置开发结构设备?我安装了 kafka/apache spark.并且正在使用 Eclipse.

谢谢

解决方案

嗯,

您在理解 Spark 如何与 Kafka 一起工作时遇到了一些问题.

首先让我们了解一些东西:

  1. Kafka 是一个低延迟和高吞吐量的流处理平台.这将使您能够非常快速地存储和读取大量数据.
  2. Spark 有两种处理方式:Spark Ba​​tch 和 Spark Streaming.您正在研究的是批处理,对于您的问题,我建议您查看 apache 流.

什么是流媒体?

流式传输是一种实时或接近实时传输和转换数据的方式.没有必要创建需要每 10 分钟或每 10 秒调用一次的进程.您将开始工作,它将使用源并在接收器中发布.

Kafka 是一个被动平台,因此 Kafka 可以是流过程的源或接收器.

就你而言,我的建议是:

  1. 为您的 Kafka 创建一个流生成器,您将在您的 Web 服务器中读取您的移动应用程序的日志.因此,您需要在 Web 服务器上插入一些东西以开始使用数据.我建议您使用 Fluentd 是一个非常强大的流媒体应用程序,这是在 Ruby 中,但真的很容易用.如果你想要更强大、更专注于大数据的东西,我建议 Apache Nifi 这很难工作,那不是很简单,但您可以创建数据流管道以将您的信息传输到您的集群.Apache Flume.
  2. 启动你的Kafka,你可以使用Docker来使用它.这将保留您的数据一段时间,并允许您在需要非常快速和大量信息时获取数据.请阅读文档以了解其工作原理.
  3. Spark Streaming - 如果您没有流处理,那么使用 Kafka 将没有意义,您在 Kafka 上生成数据的 Rest 解决方案很慢,如果批处理没有意义.因此,如果您正在编写流式传输,则也应该将其分析为流式传输.我建议您在此处阅读有关 Spark Streaming 的信息.以及如何将 Spark 与 Kafka 集成 此处.

所以,正如你所问:

我需要 REST API 吗?答案是否定的.

架构将是这样的:

Web 服务器 -> Fluentd -> Apache Kafka -> Spark Streaming -> 输出

希望能帮到你

Got some fundamental problems, hope someone can clear them up.

So I want to use Apache Kafka and Apache spark for my application. I have gone through numerous tutorials and got the basic idea of what it is and how it will work.

Use case :

Data will be generated from a mobile device(multiple devices, lets say 1000) at an interval of 40 sec and I need to process that data and add values to the database which in turn will be reflected back in a dashboard.

What I wanted to do is to use Apache Streams and make a post request from android itself and then those data will be processed by the spark application and that's it.

Issues:

  • Apache Spark

I am following this tutorial to get it up and running.( Am using JAVA, not scala) Link : https://www.santoshsrinivas.com/installing-apache-spark-on-ubuntu-16-04/

After everything is done, I execute spark-shell and it start. I have also installed zookeeper and kafka on my server and I have started the Kafka in the background, so that's not an issue.

When I run http://161.xxx.xxx.xxx:4040/jobs/ I get this page

In all the tutorial which I have gone through, there is a page like this : https://i.stack.imgur.com/gF1fN.png but I don't get this. Is it that spark is not properly installed?

Now when I want to deploy a standalone jar to spark, (Using this link : http://data-scientist-in-training.blogspot.in/2015/03/apache-spark-cluster-deployment-part-1.html ) am able to run it. i.e with the command : spark-submit --class SimpleApp.SimpleApp --master spark://http://161.xxx.xxx.xxx:7077 --name "try" /opt/spark/bin/try-0.0.1-SNAPSHOT.jar , I get the output.

Do I need to submit the application everytime if I want to use it?

This is my Program :

package SimpleApp;

/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "/opt/spark/README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(conf);
    //System.setProperty("hadoop.home.dir", "C:/winutil");
    sc.setLogLevel("ERROR"); // Don't want the INFO stuff
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
    System.out.println("word count : "+logData.first());
    sc.stop();
  }
}

  • Now how do I integrate Kafka into it?

  • How to configure the app in such a way that it get executed everytime kafka receives a message?

  • Moreover, do I need to make a REST API through which I need to send the data to kafka i.e the REST api will be used as producer? Something like spark Java framework? http://sparkjava.com/

  • If yes, again the bottleneck will happen at REST api level i.e how many request it can handle or not because everywhere I read that Kafka has a very high throughput.

  • Is the final structure going to be like SPARK JAVA -> KAFKA -> APACHE SPARK ?

  • Lastly how to do I set up the development structure on my local device? I have kafka/apache spark installed. And am using Eclipse.

Thanks

解决方案

Well,

You are facing some problems to understand how Spark works with Kafka.

First let's understand somethings:

  1. Kafka is a Stream process platform for low latency and high throughput. This will allow you to store and read lot's of data really fast.
  2. Spark has two types of processing, Spark Batch and Spark Streaming. What you are studying is batch, for your problem I suggest you to see apache streaming.

What is Streaming?

Streaming is a way to transport and transform your data in real time or near real time. It will not be necessary to create a process that you need to call every 10 minutes or every 10 seconds. You will start the job and it will consume the source and will post in the sink.

Kafka is a passive platform, so Kafka can be a source or a sink of a stream process.

In your case, what I suggest is:

  1. Create a streaming producer for your Kafka, you will read the log of your mobile application in your web server. So, you need to plug something at your web server to start the consumption of the data. What I suggest you is FluentdIs a really strong application for streaming, this is in Ruby but is really easy to use. If you want something more robust and more focused in BigData I suggest Apache Nifi This is hard to work, that is not easy but you can create pipelines of data flow to transfer your information to your cluster. And something REALLY SIMPLE and that will solve your problem is Apache Flume.
  2. Start your Kafka, you can use Docker to use it. This will hold your data for a period, and will allow you to take your data when you need really fast and with a lot of information. Please read the docs to understand how it works.
  3. Spark Streaming - That will not make sense to use a Kafka if you don't have a stream process, your solution of Rest to produce the data at Kafka is slow and if is batch doesn't make sense. So if you are writing as streaming, you should analyse as streaming too. I suggest you to read about Spark Streaming here. And how integrate the Spark with Kafka here.

So, as you asked:

Do I need a REST API? The answer is No.

The architecture will be like this:

Web Server -> Fluentd -> Apache Kafka -> Spark Streaming -> Output

I hope that will help

这篇关于Apache Kafka + Spark 集成(需要 REST API?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆