Apache Kafka + Spark集成(需要REST API吗?) [英] Apache Kafka + Spark Integration (REST API is needed?)

查看:59
本文介绍了Apache Kafka + Spark集成(需要REST API吗?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一些基本问题,希望有人能解决.

所以我想为我的应用程序使用Apache Kafka和Apache spark.我已经阅读了许多教程,并获得了它的基本概念以及它将如何工作.

用例:

数据将以40秒的间隔从移动设备(多个设备,可以说1000个设备)生成,我需要处理该数据并将值添加到数据库中,这些值又会反映在仪表板中./p>

我想做的是使用Apache Streams并从android本身发出发布请求,然后这些数据将由spark应用程序处理.

问题:

  • Apache Spark

我正在按照本教程进行设置和运行.(我正在使用JAVA,而不是Scala)链接:

在我阅读的所有教程中,都有这样的页面: https://i.stack.imgur.com/gF1fN.png ,但我不明白.火花安装不正确吗?

现在,当我想部署一个独立的jar来触发时,(使用此链接:,我得到了输出.

如果要使用该应用程序,是否需要每次提交?

这是我的程序:

 包SimpleApp;/* SimpleApp.java */导入org.apache.spark.api.java.*;导入org.apache.spark.SparkConf;导入org.apache.spark.api.java.function.Function;公共类SimpleApp {公共静态void main(String [] args){字符串logFile ="/opt/spark/README.md";//应该是系统上的一些文件SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local [*]");JavaSparkContext sc =新的JavaSparkContext(conf);//System.setProperty("hadoop.home.dir," C:/winutil);sc.setLogLevel("ERROR");//不需要INFO的东西JavaRDD< String>logData = sc.textFile(logFile).cache();长numAs = logData.filter(new Function< String,Boolean>(){public Boolean call(String s){return s.contains("a");}}).数数();长numBs = logData.filter(new Function< String,Boolean>(){public Boolean call(String s){return s.contains("b");}}).数数();System.out.println(带有a:" + numAs +的行,带有b:" + numBs的行);System.out.println("word count:" + logData.first());sc.stop();}} 

谢谢

解决方案

嗯,

您在理解Spark如何与Kafka一起使用时遇到了一些问题.

首先让我们了解一些东西:

  1. Kafka是用于低延迟和高吞吐量的流处理平台.这将使您能够真正快速地存储和读取大量数据.
  2. Spark具有两种类型的处理,Spark Ba​​tch和Spark Streaming.您正在研究的是批处理,对于您的问题,我建议您看一下Apache流.

什么是流媒体?

流式传输是一种实时或近实时传输和转换数据的方法.不必创建您需要每10分钟或每10秒调用一次的进程.您将开始工作,它将消耗源并将其发布到接收器中.

Kafka是一个被动平台,因此Kafka可以是流过程的源或接收器.

对于您而言,我的建议是:

  1. 为您的Kafka创建流式生产者,您将在Web服务器中阅读移动应用程序的日志.因此,您需要在Web服务器上插入一些东西以开始使用数据.我建议您使用 Fluentd 这是一个非常强大的流媒体应用程序,它在Ruby中,但是确实很容易采用.如果您想要更强大,更专注于BigData的功能,我建议 Apache Nifi 这很难工作,不是容易,但是您可以创建数据流管道以将信息传输到集群. Apache Flume 真的很简单,并且可以解决您的问题.
  2. 启动您的Kafka,您可以使用Docker来使用它.这将保留您的数据一段时间,并允许您在确实需要大量信息的情况下快速获取数据.请阅读文档以了解其工作原理.
  3. Spark Streaming-如果您没有流式处理,使用Rest的解决方案在Kafka上生成数据的解决方案很慢,而如果批量处理则没有意义,那么使用Kafka就没有意义.因此,如果您以流媒体形式进行写作,那么您也应该以流媒体形式进行分析.建议您在此处阅读有关Spark Streaming的信息.以及如何将Spark与Kafka集成https://www.santoshsrinivas.com/installing-apache-spark-on-ubuntu-16-04/

    After everything is done, I execute spark-shell and it start. I have also installed zookeeper and kafka on my server and I have started the Kafka in the background, so that's not an issue.

    When I run http://161.xxx.xxx.xxx:4040/jobs/ I get this page

    In all the tutorial which I have gone through, there is a page like this : https://i.stack.imgur.com/gF1fN.png but I don't get this. Is it that spark is not properly installed?

    Now when I want to deploy a standalone jar to spark, (Using this link : http://data-scientist-in-training.blogspot.in/2015/03/apache-spark-cluster-deployment-part-1.html ) am able to run it. i.e with the command : spark-submit --class SimpleApp.SimpleApp --master spark://http://161.xxx.xxx.xxx:7077 --name "try" /opt/spark/bin/try-0.0.1-SNAPSHOT.jar , I get the output.

    Do I need to submit the application everytime if I want to use it?

    This is my Program :

    package SimpleApp;
    
    /* SimpleApp.java */
    import org.apache.spark.api.java.*;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.function.Function;
    
    public class SimpleApp {
      public static void main(String[] args) {
        String logFile = "/opt/spark/README.md"; // Should be some file on your system
        SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
        JavaSparkContext sc = new JavaSparkContext(conf);
        //System.setProperty("hadoop.home.dir", "C:/winutil");
        sc.setLogLevel("ERROR"); // Don't want the INFO stuff
        JavaRDD<String> logData = sc.textFile(logFile).cache();
    
        long numAs = logData.filter(new Function<String, Boolean>() {
          public Boolean call(String s) { return s.contains("a"); }
        }).count();
    
        long numBs = logData.filter(new Function<String, Boolean>() {
          public Boolean call(String s) { return s.contains("b"); }
        }).count();
    
        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
        System.out.println("word count : "+logData.first());
        sc.stop();
      }
    }
    

    • Now how do I integrate Kafka into it?

    • How to configure the app in such a way that it get executed everytime kafka receives a message?

    • Moreover, do I need to make a REST API through which I need to send the data to kafka i.e the REST api will be used as producer? Something like spark Java framework? http://sparkjava.com/

    • If yes, again the bottleneck will happen at REST api level i.e how many request it can handle or not because everywhere I read that Kafka has a very high throughput.

    • Is the final structure going to be like SPARK JAVA -> KAFKA -> APACHE SPARK ?

    • Lastly how to do I set up the development structure on my local device? I have kafka/apache spark installed. And am using Eclipse.

    Thanks

    解决方案

    Well,

    You are facing some problems to understand how Spark works with Kafka.

    First let's understand somethings:

    1. Kafka is a Stream process platform for low latency and high throughput. This will allow you to store and read lot's of data really fast.
    2. Spark has two types of processing, Spark Batch and Spark Streaming. What you are studying is batch, for your problem I suggest you to see apache streaming.

    What is Streaming?

    Streaming is a way to transport and transform your data in real time or near real time. It will not be necessary to create a process that you need to call every 10 minutes or every 10 seconds. You will start the job and it will consume the source and will post in the sink.

    Kafka is a passive platform, so Kafka can be a source or a sink of a stream process.

    In your case, what I suggest is:

    1. Create a streaming producer for your Kafka, you will read the log of your mobile application in your web server. So, you need to plug something at your web server to start the consumption of the data. What I suggest you is FluentdIs a really strong application for streaming, this is in Ruby but is really easy to use. If you want something more robust and more focused in BigData I suggest Apache Nifi This is hard to work, that is not easy but you can create pipelines of data flow to transfer your information to your cluster. And something REALLY SIMPLE and that will solve your problem is Apache Flume.
    2. Start your Kafka, you can use Docker to use it. This will hold your data for a period, and will allow you to take your data when you need really fast and with a lot of information. Please read the docs to understand how it works.
    3. Spark Streaming - That will not make sense to use a Kafka if you don't have a stream process, your solution of Rest to produce the data at Kafka is slow and if is batch doesn't make sense. So if you are writing as streaming, you should analyse as streaming too. I suggest you to read about Spark Streaming here. And how integrate the Spark with Kafka here.

    So, as you asked:

    Do I need a REST API? The answer is No.

    The architecture will be like this:

    Web Server -> Fluentd -> Apache Kafka -> Spark Streaming -> Output

    I hope that will help

    这篇关于Apache Kafka + Spark集成(需要REST API吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆