具有相似措辞的Scala RDD匹配 [英] Scala RDD matching with similar wording

查看:111
本文介绍了具有相似措辞的Scala RDD匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个动词列表

假设:

动词.txt

have, have, having, had
give, give, gave, given
take, take, took, taken

将它们拆分为 rdds

Splitted them into rdds

val verbs = sc.textFile("verbs.txt").map(x => x.split("\n").collect()

因此,

verbs: Array[Array[String]] = Array(Array(have, have, having, had), Array(give, give, gave, given), Array(take, take, took, taken))

假设:

val wordcount = sc.textFile("data.txt")

数据.txt

have have have having having had had had had had give give give give give give give give give give gave gave given given given given take take took took took took took took taken taken

我已经计算了字数,因此字数 =

I've calculated the wordcount and therefore wordcount =

(have, 3)
(having, 2)
(had, 5)
(give, 10)
(gave, 2)
(given, 4)
(take, 2)
(took, 6)
(taken, 2)

我希望能够将数据与相同的动词合并在一起示例:(have,3),(have,2),(had,5) =>(有,10)

I want to be able to merge the data together with the same verbs Example: (have,3),(having,2),(had,5) => (have, 10)

使用数组的第一个值返回动词的基本形式.我怎么能做到这一点?

To return the base form of the verb using the first value of the array. How am I able to do that?

推荐答案

由于您将问题标记为 RDD,因此我假设您的字数数据是 RDD.

Since you tag your question as RDD, I am assuming your word count data is a RDD.

  // Read text file
  val sc = spark.sparkContext
  val textFile: RDD[String] = sc.textFile("data.txt")

  // So you have this as you said
  val verbs = Array(Array("have", "have", "having", "had"), Array("give", "give", "gave", "given"), Array("take", "take", "took", "taken"))

  val data= textFile
    .flatMap(_.split(" ")) // Split each line to words/tokens its called tokenization (I used backspace as seperator if you have tabs as seperator use that)
    .map(t => (t, 1)) // Generate count per token (i.e. (have, 1))
    .reduceByKey(_ + _) // Count appearance of each token (i.e. (have, 5)


  val t = data.map(d => (verbs.find(v => v.contains(d._1)).map(_.head).getOrElse(d._1), d._2)) // Generates RDD of (optional base verb, count for that verb) e.g (having, 5) => (have, 5), unknown verbs left as it is
    .reduceByKey(_ + _) // Sum all values that having same base verb (have, 5), (have, 3) => (have, 8)

  t.take(10).foreach(println)

其他选项(不收集动词)

  // You dont have to collect this If you want
  val verbs2 = sc.parallelize(Array(Array("have", "have", "having", "had"), Array("give", "give", "gave", "given"), Array("take", "take", "took", "taken"))) // This is the state before collect
    .flatMap(v => v.map(v2 => (v2, v.head))) // This generates tuples of verb -> base verb (e.g had -> have)
    .reduceByKey((k1, k2) => if (k1 == k2) k1 else k2) // Current verbs array generates (have -> have twice, this eliminates duplicate records)

  val data2 = textFile
    .flatMap(_.split(" ")) // Split each line to words/tokens its called tokenization (I used backspace as seperator if you have tabs as seperator use that)
    .map(t => (t, 1)) // Generate count per token (i.e. (have, 1))
    .reduceByKey(_ + _) // Count appearance of each token (i.e. (have, 5)

  val t2 = verbs2.join(data2) // This will join two RDD by their keys (verbs -> (base verb, verb count))
    .map(d => d._2) // This is what we need key is base verb, value is count of that verb
    .reduceByKey(_ + _) // Sum all values that having same base verb (have, 5), (have, 3) => (have, 8)

  t2.take(10).foreach(println)

当然,这个答案假设您将始终拥有动词数组,并且第一个元素是基本形式.如果你想要一些没有动词数组的东西,并将任何动词转换为实际上是 NLP(自然语言处理)任务的基本格式,你需要使用某种词规范化技术,如 this(如 EmiCareOfCell44 所示).您也可以在 spark ML 库中找到此类程序的实现.

Of course this answer assumes you will always have your verbs array and first element is the base form. If you want something that works without a verbs array and convert any verb to base format that is actually a NLP (Natural Language Processing) task and you need to use some kind of word normalization technique like this (As EmiCareOfCell44 indicated). You can also find implementation of such procedures in spark ML library.

这篇关于具有相似措辞的Scala RDD匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆