如何将每个 DStream 保存/插入到永久表中 [英] How to save/insert each DStream into a permanent table

查看:21
本文介绍了如何将每个 DStream 保存/插入到永久表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于将输出 Dstream 插入永久 SQL 表的Spark Streaming"问题,我一直面临着问题.我想将每个输出 DStream(来自激发进程的单个批次)插入到一个唯一的表中.我一直在使用 Python 和 Spark 1.6.2 版.

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.

在我代码的这一部分,我有一个由一个或多个 RDD 组成的 Dstream,我想将其永久插入/存储到 SQL 表中,而不会丢失每个处理批次的任何结果.

At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.

rr = feature_and_label.join(result_zipped)\
                      .map(lambda x: (x[1][0][0], x[1][1]) )

这里的每个 Dstream 都表示为这样的元组:(4.0, 0).我不能使用 SparkSQL,因为 Spark 处理表"的方式,即,就像一个临时表,因此在每个批次中都会丢失结果.

Each Dstream here is represented for instance like this tuple: (4.0, 0). I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.

这是一个输出示例:

(0.0, 2)

(4.0, 0)

(4.0, 0)

...

如上所示,每个批次仅由一个 Dstream 制成.正如我之前所说,我想将这些结果永久存储到保存在某处的表中,并可能在以后查询.所以我的问题是:有没有办法做到这一点?
我很感激是否有人可以帮助我解决这个问题,但特别是告诉我这是否可能.谢谢.

As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is: is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not. Thank you.

推荐答案

除非您下载了与 HDFS 一起打包的版本(尽管它们看起来是 在 Spark 2.0 中尝试这个想法).将结果存储到永久表并稍后查询这些结果的一种方法是使用 Spark 数据库生态系统中的各种数据库之一.每种方法都有利有弊,您的用例很重要.我会提供一些接近主列表的东西.这些细分为:

Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:

  • HDFS

这篇关于如何将每个 DStream 保存/插入到永久表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆