如何有效地从单个字符串列RDD中提取多个列? [英] How to extract efficiently multiple columns from a single string column RDD?

查看:387
本文介绍了如何有效地从单个字符串列RDD中提取多个列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含20多个列的文件,我想从中提取一些.到目前为止,我有以下代码.我敢肯定有一种聪明的方法可以做到这一点,但无法使其成功运行.有什么想法吗?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?

mvnmdata的类型为RDD [String]

mvnmdata is of type RDD[String]

val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```

推荐答案

如下所示,如果您不想编写重复的x(i),则可以循环处理它.范例1:

As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:

val strpcols = mvnmdata.map(x => x.split('|'))
  .map(x =>{
    val xbuffer = new ArrayBuffer[String]()
    for (i <- Array(0,1,5,6...)){
      xbuffer.append(x(i))
    }
    xbuffer
  })

如果您只想用start& end定义索引列表以及要排除的数字,请参见下面的示例2:

If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:

scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)

scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)

您想要的最终代码:

  //define the function to process indexes
  def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
    ((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
  }

  val strpcols = mvnmdata.map(x => x.split('|'))
    .map(x =>{
      val xbuffer = new ArrayBuffer[String]()
      //call the function
      for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
        xbuffer.append(x(i))
      }
      xbuffer
    })

这篇关于如何有效地从单个字符串列RDD中提取多个列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆