使用Spark并行缓存和查询数据集 [英] Cache and Query a Dataset In Parallel Using Spark

查看:94
本文介绍了使用Spark并行缓存和查询数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要求,我要缓存数据集,然后通过在该数据集上并行触发"N"个查询来计算一些指标,并且所有这些查询都计算相似的指标,只是过滤器会更改并且我想运行这些查询是并行进行的,因为响应时间至关重要,而我要缓存的数据集的大小总是小于1 GB.

I have a requirement where I want to cache a dataset and then compute some metrics by firing "N" number of queries in parallel over that dataset and all these queries compute similar metrics just that the filters would change and I want to run these queries in parallel because response time is crucial and the dataset which I would like to cache will be always less than a GB in size.

我知道如何在Spark中缓存数据集,然后对其进行查询,但是如果我必须在同一数据集上并行运行查询,如何实现相同的呢?引入alluxio是一种方法,但是在Spark world中,我们可以通过其他方法实现相同的目的吗?

I know how to cache a dataset in Spark and then query it subsequently, but If I have to run queries in parallel over the same dataset, how can I achieve the same ? Introducing alluxio is one way, but any other way we can achieve the same in Spark world ?

例如,使用Java,我可以将数据缓存在内存中,然后通过使用多线程可以达到相同的目的,但是如何在Spark中做到这一点?

For example with Java, I can cache the data in memory and then by using multi threading I can achieve the same, but how to do it in Spark ?

推荐答案

使用Scala的并行集合在Spark的驱动程序代码中触发并行查询非常简单.这里是一个最小的示例,看起来像这样:

It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:

val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()


// define your queries, instead of returning a dataframe you could also write to a table etc
val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")

// Fire queries in parallel
import scala.collection.parallel.ParSeq
ParSeq(query1,query2).foreach(query => query(dfSrc).show())

要在地图中收集查询ID和结果,您应该这样做:

To collect Query-ID and Result in a map you should so:

val resultMap  = ParSeq(
 (1,query1), 
 (2,query2)
).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap

这篇关于使用Spark并行缓存和查询数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆