Apache Spark可以将几条相似的行合并为一行吗? [英] Can Apache Spark merge several similar lines into one line?

查看:96
本文介绍了Apache Spark可以将几条相似的行合并为一行吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Apache Spark完全陌生,因此,如果我的问题似乎很幼稚,但我在互联网上找不到明确的答案,我感到非常抱歉.

I am totaly new with Apache Spark, therefore, I am very sorry if my question seems to be naive but I did not find a clear answer on internet.

这是我问题的上下文:我想从Apache Kafka服务器检索json输入数据.格式如下:

Here is the context of my problem: I want to retrieve json input data from an Apache Kafka server. The format is as follows:

{"deviceName":"device1", "counter":125}
{"deviceName":"device1", "counter":125}
{"deviceName":"device2", "counter":88}
{"deviceName":"device1", "counter":125}
{"deviceName":"device2", "counter":88}
{"deviceName":"device1", "counter":125}
{"deviceName":"device3", "counter":999}
{"deviceName":"device3", "counter":999}

使用Spark或Spark Streaming,我想处理这些数据并获得以下格式的输出:

With Spark or Spark Streaming, i want to process this data and to get as an output the following format:

{"deviceName":"device1", "counter":125, "nbOfTimes":4}
{"deviceName":"device2", "counter":88, "nbOfTimes":2}
{"deviceName":"device3", "counter":999, "nbOfTimes":2}

因此,我想知道Spark能否实现我要搜索的功能.如果是的话,您能给我一些指导吗?我会非常感激.

So, I would like to know if what I am searching to do is possible with Spark. And if yes, can you give me some guidance about it ? I would be so thankful.

推荐答案

可以使用Spark和Spark Streaming完成.但是,让我们考虑第一种情况,其中包含一个包含您的数据的json文件.

It can be done with Spark and Spark Streaming. But let's consider the first case with a json file containing your data.

val df = sqlContext.read.format("json").load("text.json")
// df: org.apache.spark.sql.DataFrame = [counter: bigint, deviceName: string]      

df.show
// +-------+----------+
// |counter|deviceName|
// +-------+----------+
// |    125|   device1|
// |    125|   device1|
// |     88|   device2|
// |    125|   device1|
// |     88|   device2|
// |    125|   device1|
// |    999|   device3|
// |    999|   device3|
// +-------+----------+

df.groupBy("deviceName","counter").count.toDF("deviceName","counter","nbOfTimes").show
// +----------+-------+---------+                                                  
// |deviceName|counter|nbOfTimes|
// +----------+-------+---------+
// |   device1|    125|        4|
// |   device2|     88|        2|
// |   device3|    999|        2|
// +----------+-------+---------+

很显然,您以后可以将其编写为所需的任何格式.但是我认为您已经掌握了主要思想.

Obviously you can write it to any format you want later on. But I think that you get the main idea.

这篇关于Apache Spark可以将几条相似的行合并为一行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆