在星火和Scala文本操作 [英] Text manipulation in Spark and Scala

查看:166
本文介绍了在星火和Scala文本操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据:

review/text: The product picture and part number match, but they together do not math the description.

review/text: A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.

review/text: This power supply did the job and got my computer back online in a hurry.

review/text: Not only did the supply work. it was easy to install, a lot quieter than the PowMax that fried.

review/text: This is an awesome power supply that was extremely easy to install. 

review/text: I had my doubts since best buy would end up charging me $60. at the time I bought my camera for the card and the cable.

review/text: Amazing... Installed the board, and that's it, no driver needed. Work great, no error messages.

和我试过:

import org.apache.spark.{SparkContext, SparkConf}

object test12 {
  def filterfunc(s: String): Array[((String))] = {
    s.split( """\.""") 
      .map(_.split(" ")
      .filter(_.nonEmpty)
      .map(_.replaceAll( """\W""", "")
      .toLowerCase)
      .filter(_.nonEmpty)
      .flatMap(x=>x)
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.textFile("data/2012/2012.txt")
    val stopWords = sc.broadcast(List[String]("reviewtext", "a", "about", "above", "according", "accordingly", "across", "actually",...)

    var grouped_doc_words = rdd.flatMap({ (line) =>
      val words = line.map(filterfunc).filter(word_filter.value))
      words.map(w => {
        (line.hashCode(), w)
      })
    }).groupByKey()

  }
}

和我要生成的输出:

doc1: product picture number match together not math description. 
doc2: necessity garmin. adapter power unit my motorcycle. works like charm.
doc3: power supply job computer online hurry.
doc4: not supply work. easy install quieter powmax fried.
...

一些例外:()1-(不,不,不,没有)不被发射2 - 所有的点符号必须保持

some exception: 1- (not , n't , non , none) not to be emitted 2- all dot (.) symbols must be held

我的上述code不工作得很好。

my above code doesn't work very well.

推荐答案

为什么不只是某事像这样:

Why not just sth like this:

这样,你不需要任何分组或flatMapping。

This way you don't need any grouping or flatMapping.

编辑:

我用手写这确实有一些小错误,但我希望的想法是清楚的。下面是测试code:

I was writing this by hand and indeed there was some little bugs but i hoped idea was clear. Here is tested code:

def processLine(s: String, stopWords: Set[String]): List[String] = {
    s.toLowerCase()
      .replaceAll(""""[^a-zA-Z\.]""", "")
      .replaceAll("""\.""", " .")
      .split("\\s+")
      .filter(!stopWords.contains(_))
      .toList
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.parallelize(
      List(
        "The product picture and part number match, but they together do not math the description.",
        "A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.",
        "This power supply did the job and got my computer back online in a hurry."
      )
    )
    val stopWords = sc.broadcast(
      Set("reviewtext", "a", "about", "above",
        "according", "accordingly",
        "across", "actually", "..."))
    val grouped_doc_words = rdd.map(processLine(_, stopWords.value))
    grouped_doc_words.collect().foreach(p => println(p))
  }

这是结果,为您提供:

List(the, product, picture, and, part, number, match,, but, they, together, do, not, math, the, description, .)
List(necessity, for, the, garmin, ., used, the, adapter, to, power, the, unit, on, my, motorcycle, ., works, like, charm, .)
List(this, power, supply, did, the, job, and, got, my, computer, back, online, in, hurry, .)

现在,如果你想字符串没有列出只是做:

Now if you want string not list just do:

grouped_doc_words.map(_.mkString(" "))

这篇关于在星火和Scala文本操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆