使用Spark删除RDD的第一行和最后一行 [英] Dropping the first and last row of an RDD with Spark

查看:342
本文介绍了使用Spark删除RDD的第一行和最后一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sc.textFile(fileLocation)和spark读取文本文件,并且需要能够快速删除第一行和最后一行(它们可以是标头或结尾).我已经找到了返回第一行和最后一行的好方法,但是没有删除它们的好方法.这可能吗?

I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?

推荐答案

一种方法是zipWithIndex,然后过滤出索引为0count - 1的记录:

One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

请注意,就性能而言,这可能是相当昂贵的(如果缓存RDD,则会用完内存;否则,您将读取RDD两次).因此,如果您可以根据其内容识别这些记录(例如,如果您知道所有记录,但这些记录应包含某种模式),则使用filter可能会更快.

Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.

这篇关于使用Spark删除RDD的第一行和最后一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆