Scala RDD字符串操作 [英] Scala RDD String manipulation

查看:77
本文介绍了Scala RDD字符串操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 name 的RDD.

I have a RDD entitled name.

scala> name
res6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:37

我可以使用 name.foreach(println)

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

我希望创建一个新的RDD,以从每条记录的开头删除 name 字符,并以 long 格式返回其余数字.

I wish to create a new RDD that removes the name characters from the beginning of each record and returns the remaining numbers in long format.

所需结果:

5000005125651330
5000005125651331
5000005125651332
5000005125651333

我尝试了以下操作:

val name_clean = name.filter(_ != "name")

但是这返回:

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

推荐答案

RDD中的每个条目都是一个字符串.因此,将其与名称"进行比较将始终失败,因为它是名称" +一些数字.

Each entry in the RDD is a string. So comparing it to "name" will always fail, as it's "name"+some digits.

您需要的是 map 来遍历RDD并为每个条目返回一个新值.并且该新值应为字符串(不包含前4个字符),并转换为Long.

What you need is map to iterate over the RDD and return a new value for each entry. And that new value should be the string, without the first 4 characters, and converted to Long.

总之,我们得到

name.map(_.drop(4).toLong)

如果您不知道前四个字符将是名称",则可能需要先检查一下.然后,您需要的内容取决于您要对没有名称作为前四位的行进行处理,例如

If you don't know the first four characters will be "name", you might want to check that first. What you need then depends on what you want to do with rows that don't have name as the first four, but something like

name.filter(_.startsWith("name")).map(_.drop(4).toLong)

这篇关于Scala RDD字符串操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆