星火滤波基于两个数组中RDD的比赛 [英] Spark filtering based on matches in two Arrays in RDD's

查看:155
本文介绍了星火滤波基于两个数组中RDD的比赛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有话的RDD,比我的东西,包含一个字符串,如果匹配,使得它从字符串中删除另一个RDD。

I have a RDD of Words, than I have another RDD of something that contains a string that if a match is made it is removed from the string.

val wordList = sc.textFile("wordList.txt").map(x => x.split(',')).map(x => x(0))

单词表样本:

res15: Array[String] = Array(basetting, choosinesses, concavenesses, crabbinesses, cupidinously, falliblenesses, fleecinesses, hackishes, immaterialnesses, impiousnesses)

比我有其他的:

val filterWord = posts.map(x => (x._1, x._2.split(" ").filter(x => x != (wordList)))

样filterWord:

Sample filterWord:

res16: Array[(String, Array[String])] = Array((6,Array(how, sweet, is, it, that, we, have)), (2,Array("")), (2,Array(will, this, question, cause, an, error)), (2,Array("")), (4,Array(how, do, we, create, a, new, tag, in), (7,Array("")), (2,Array(test, after, clr, on)), (2,Array("")), (2,Array(testing, a, long, tag)), (2,Array("")))

我需要得到filterWord只包含不在词表的话,但似乎并不奏效,因为它的不过滤掉任何词语的词表,如果我将其更改为==相反,它过滤掉一切。

I need to get filterWord to only contain words that are not in the wordList but doesnt seem to be working because it is not filtering out any words in the wordList and if I change it to == instead it filters out everything.

推荐答案

这将删除任何包含的词表的话任何职务。它可能是也可能不是你想要的。请不要澄清你的问题。

This removes any post that contains any of the words in wordlist. It may or may not be what you want. Please do clarify your question.

星火设置。

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)

测试数据:

val jabberwocky = """
Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

"Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!"

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

"And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe
"""
val words = "the and in all were"

测试数据转换为RDDS。

Convert the test data to RDDs.

val posts = sc.parallelize(jabberwocky.split('\n')
                                      .filter(_.nonEmpty)
                                      .zipWithIndex
                                      .map (_.swap))

val wordList = sc.parallelize(words.split(' ')).map(x => (x.toLowerCase(), x))

请一个PairRDD那里是在每个岗位每个单词一行。关键是每一个字,值是原帖

Make a PairRDD where there is a row for each word in each post. The key is each of the words, and the value is the original post

val postsPairs = posts.flatMap
    { case (i, s) => s.split("\\W+").map(w=> (w.toLowerCase(), (i, s))) }

找到所有那些具有的排除一个关键词职位

Find all the posts that DO have one of the excluded words

  val withExcluded = postsPairs.join(wordList).map(_._2._1)

(可以做一个 .distinct 在这里,但没有点,重复的不会的问题,为下一步)

(could do a .distinct here but there's no point, the duplicates won't matter for the next step)

删除所有从原来的名单具有的排除一个关键词的帖子。因此,任何剩余有没有的话排除。 WWWWW。

Remove all the posts from the original list that have one of the excluded words. So any remaining have none of the excluded words. WWWWW.

  val res = posts.subtract(withExcluded)

  // (19,      He went galumphing back.)
  // (22,O frabjous day! Callooh! Callay!")
  // (21,      Come to my arms, my beamish boy!)

这篇关于星火滤波基于两个数组中RDD的比赛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆