如何测试Spark RDD [英] How to Test Spark RDD

查看:120
本文介绍了如何测试Spark RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我们是否可以在Spark中测试RDD.

I am not sure whether we can Test RDD's in Spark.

我碰到一篇文章,其中说模拟RDD不是一个好主意.是否有其他方法或最佳做法来测试RDD的

I came across an article where it says Mocking a RDD is not a good idea. Is there any other way or any best practice for Testing RDD's

推荐答案

感谢您提出这个悬而未决的问题.出于某种原因,在谈到Spark时,每个人都被分析深深吸引,以至于他们忘记了过去15年左右出现的出色软件工程实践.这就是为什么我们在课程中着重讨论测试和持续集成(例如DevOps等)的原因.

Thank you for putting this outstanding question out there. For some reason, when it comes to Spark, everyone gets so caught up in the analytics that they forget about the great software engineering practices that emerged the last 15 years or so. This is why we make it a point to discuss testing and continuous integration (among other things like DevOps) in our course.

术语快速了解

A Quick Aside on Terminology

在继续之前,我必须对KnolX演示文稿@himanshuIIITian引用略有不同意见. true 单元测试意味着您可以完全控制测试中的每个组件.不能与数据库,REST调用,文件系统甚至系统时钟进行交互;正如Gerard Mezaros将其放入 xUnit测试模式一样,所有内容都必须加倍"(例如被嘲笑,存根等). .我知道这看起来像语义,但这确实很重要.无法理解这是您在持续集成中看到间歇性测试失败的主要原因之一.

Before I go on, I have to express a minor disagreement with the KnolX presentation @himanshuIIITian cites. A true unit test means you have complete control over every component in the test. There can be no interaction with databases, REST calls, file systems, or even the system clock; everything has to be "doubled" (e.g. mocked, stubbed, etc) as Gerard Mezaros puts it in xUnit Test Patterns. I know this seems like semantics, but it really matters. Failing to understand this is one major reason why you see intermittent test failures in continuous integration.

我们仍然可以进行单元测试

We Can Still Unit Test

因此,基于此理解,不可能对RDD进行单元测试.但是,在开发分析时仍然存在进行单元测试的地方.

So given this understanding, unit testing an RDD is impossible. However, there is still a place for unit testing when developing analytics.

(注意:我将在示例中使用Scala,但是概念会超越语言和框架.)

考虑一个简单的操作:

rdd.map(foo).map(bar)

foobar是简单的函数.可以按照常规方式对它们进行单元测试,并且应该在尽可能多的角落情况下进行测试.毕竟,他们为什么要关心从测试夹具还是RDD那里获得输入的地方?

Here foo and bar are simple functions. Those can be unit tested in the normal way, and they should be with as many corner cases as you can muster. After all, why do they care where they are getting their inputs from whether it is a test fixture or an RDD?

别忘了Spark Shell

Don't Forget the Spark Shell

这不是在测试本身,但是在这些早期阶段,您还应该在Spark shell中进行试验,以弄清您的转换方式,尤其是方法的后果.例如,您可以使用许多不同的功能(例如toDebugStringexplainglomshowprintSchema等)检查物理和逻辑查询计划,分区策略和保存以及数据状态.在.我将让您探索那些.

This isn't testing per se, but in these early stages you also should be experimenting in the Spark shell to figure out your transformations and especially the consequences of your approach. For example, you can examine physical and logical query plans, partitioning strategy and preservation, and the state of your data with many different functions like toDebugString, explain, glom, show, printSchema, and so on. I will let you explore those.

您还可以在Spark外壳和测试中将主服务器设置为local[2],以识别仅在开始分发工作后才可能出现的任何问题.

You can also set your master to local[2] in the Spark shell and in your tests to identify any problems that may only arise once you start to distribute work.

与Spark的集成测试

Integration Testing with Spark

现在好玩的东西.

为了对集成函数的质量和RDD/DataFrame转换逻辑充满信心,在进行集成测试 Spark时,务必要做一些事情(无论构建如何)工具和测试框架):

In order to integration test Spark after you feel confident in the quality of your helper functions and RDD/DataFrame transformation logic, it is critical to do a few things (regardless of build tool and test framework):

  • 增加JVM内存.
  • 启用分叉,但禁用并行执行.
  • 使用测试框架将Spark集成测试累积到套件中,并在所有测试之前将SparkContext初始化,并在所有测试之后将其停止.
  • Increase JVM memory.
  • Enable forking but disable parallel execution.
  • Use your test framework to accumulate your Spark integration tests into suites, and initialize the SparkContext before all tests and stop it after all tests.

最后一种方法有几种.可以从 spark-testing-base 中获得,这两个方法均由@Pushkr和KnolX演示文稿引用由@himanshuIIITian链接.

There are several ways to do this last one. One is available from the spark-testing-base cited by both @Pushkr and the KnolX presentation linked by @himanshuIIITian.

贷款方式

The Loan Pattern

另一种方法是使用贷款模式.

例如(使用ScalaTest):

For example (using ScalaTest):

class MySpec extends WordSpec with Matchers with SparkContextSetup {
  "My analytics" should {
    "calculate the right thing" in withSparkContext { (sparkContext) =>
      val data = Seq(...)
      val rdd = sparkContext.parallelize(data)
      val total = rdd.map(...).filter(...).map(...).reduce(_ + _)

      total shouldBe 1000
    }
  }
}

trait SparkContextSetup {
  def withSparkContext(testMethod: (SparkContext) => Any) {
    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("Spark test")
    val sparkContext = new SparkContext(conf)
    try {
      testMethod(sparkContext)
    }
    finally sparkContext.stop()
  }
} 

如您所见,贷款模式"利用高阶函数将SparkContext贷款"给测试,然后在测试完成后将其丢弃.

As you can see, the Loan Pattern makes use of higher-order functions to "loan" the SparkContext to the test and then to dispose of it after it's done.

面向痛苦的编程(感谢Nathan)

Suffering-Oriented Programming (Thanks, Nathan)

这完全是一个优先事项,但是我更喜欢使用贷款模式并尽可能多地自行整理东西,然后再引入另一个框架.除了保持轻量级之外,框架有时还会添加很多魔术",这使得调试测试失败难以推理.因此,我采用了面向痛苦的编程的方法,在这种情况下,我避免添加新的方法.框架,直到没有它的痛苦为止.但是,这完全取决于您.

It is totally a matter of preference, but I prefer to use the Loan Pattern and wire things up myself as long as I can before bringing in another framework. Aside from just trying to stay lightweight, frameworks sometimes add a lot of "magic" that makes debugging test failures hard to reason about. So I take a Suffering-Oriented Programming approach--where I avoid adding a new framework until the pain of not having it is too much to bear. But again, this is up to you.

spark-testing-base 真正发挥作用的地方是基于Hadoop的帮助程序,例如HDFSClusterLikeYARNClusterLike.混合这些特征确实可以为您省去很多设置上的麻烦.另一个亮点是类似 Scalacheck 的属性和生成器.但是,我个人会一直推迟使用它,直到我的分析和测试达到这种复杂程度为止.

Now one place where spark-testing-base really shines is with the Hadoop-based helpers like HDFSClusterLike and YARNClusterLike. Mixing those traits in can really save you a lot of setup pain. Another place where it shines is with the Scalacheck-like properties and generators. But again, I would personally hold off on using it until my analytics and my tests reach that level of sophistication.

具有Spark Streaming的集成测试

Integration Testing with Spark Streaming

最后,我只想展示一个带有内存中值的SparkStreaming集成测试设置的摘要:

Finally, I would just like to present a snippet of what a SparkStreaming integration test setup with in-memory values would look like:

val sparkContext: SparkContext = ...
val data: Seq[(String, String)] = Seq(("a", "1"), ("b", "2"), ("c", "3"))
val rdd: RDD[(String, String)] = sparkContext.parallelize(data)
val strings: mutable.Queue[RDD[(String, String)]] = mutable.Queue.empty[RDD[(String, String)]]
val streamingContext = new StreamingContext(sparkContext, Seconds(1))
val dStream: InputDStream = streamingContext.queueStream(strings)
strings += rdd

这比看起来简单.实际上,它只是将数据序列转换为队列以馈送到DStream.实际上,大多数只是与Spark API一起使用的样板设置.

This is simpler than it looks. It really just turns a sequence of data into a queue to feed to the DStream. Most of it is really just boilerplate setup that works with the Spark APIs.

这可能是我最长的帖子,所以我将其留在这里.我希望其他人能提出其他想法,以同样敏捷的软件工程实践来改善我们的分析质量,从而改善所有其他应用程序的开发.

This might be my longest post ever, so I will leave it here. I hope others chime in with other ideas to help improve the quality of our analytics with the same agile software engineering practices that have improved all other application development.

对于无耻的插件表示歉意,您可以查看我们的课程

And with apologies for the shameless plug, you can check out our course Analytics with Apache Spark, where we address a lot of these ideas and more. We hope to have an online version soon.

这篇关于如何测试Spark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆