放入Spark DataFrame的第一行 [英] Drop first row of Spark DataFrame

查看:514
本文介绍了放入Spark DataFrame的第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Spark / Scala代码中有一个DataFrame类型的变量 rawData



我想删除第一个元素,如下所示:

  rawData.drop(1)

但是, drop 功能不可用。



删除第一个元素的最简单的方法是什么?

解决方案

为了回答这个问题,我们首先要澄清一下恰恰是DataFrame的第一个元素,因为我们不是说一个放在单个机器上的有序集合,而是处理分布式集合,而在分区之间没有特定的顺序,所以答案并不明显。



如果您想从每个可以使用的分区中删除第一个元素:

  df.mapPartitions(iterator => iterator.drop(1))

如果你想从第一个元素中删除第一个元素分区,可以使用:

  val rdd = df.rdd.mapPartitionsWithIndex {
case(index,iterator)= > if(index == 0)iterator.drop(1)else iterator
}
sqlContext.createDataFrame(rdd,df.schema)

这两种解决方案都不是很优雅,而且看起来像是不好的做法,很有趣的是了解完整的用例,也许有更好的方法。


I have a variable rawData of type DataFrame in my Spark/Scala code.

I would like to drop the first element, something like this:

rawData.drop(1)

However, the drop function is not available.

What's the simplest way of dropping the first element?

解决方案

To answer the question we first must clarify what is exactly the first element of a DataFrame, since we are not speaking about an ordered collection that placed on a single machine, but instead we are dealing with distributed collection with no particular order between partitions, so the answer is not obvious.

In case you want to drop the first element from every partition you can use:

df.mapPartitions(iterator => iterator.drop(1))

In case you want to drop the first element from the first partition, you can use:

val rdd = df.rdd.mapPartitionsWithIndex{
  case (index, iterator) => if(index==0) iterator.drop(1) else iterator
}
sqlContext.createDataFrame(rdd, df.schema)

Both solutions are not very graceful, and seems like bad practise, would be interesting to know the complete use case, maybe there is a better approach.

这篇关于放入Spark DataFrame的第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆