使用Scala在Apache Spark中拆分字符串 [英] Splitting strings in Apache Spark using Scala

查看:423
本文介绍了使用Scala在Apache Spark中拆分字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含以下格式的行(制表符分隔):

I have a dataset, which contains lines in the format (tab separated):

Title<\t>Text

现在对于Text中的每个单词,我想创建一个(Word,Title)对. 例如:

Now for every word in Text, I want to create a (Word,Title) pair. For instance:

ABC      Hello World

给我

(Hello, ABC)
(World, ABC)

我使用Scala编写了以下内容:

Using Scala, I wrote the following:

val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))

但这给了我以下输出:

[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285

我该如何解决?

进一步阅读

我想要实现的是计算特定TitleText中出现的Words的数量.

What I want to achieve is to count the number of Words that occur in a Text for a particular Title.

我编写的后续代码是:

val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)

但是它不起作用.请随时对此提供您的意见.谢谢!

But it does not work. Please feel free to give your inputs on this. Thanks!

推荐答案

所以...在spark中,您使用的是称为RDD的分布式数据结构.它们提供的功能类似于scala的集合类型.

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title

这篇关于使用Scala在Apache Spark中拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆