根据他们在Scala和星火频率更换双字母组 [英] Replace bigrams based on their frequency in Scala and Spark

查看:127
本文介绍了根据他们在Scala和星火频率更换双字母组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要全部更换双字母组,他们的频率计数大于这种模式阈值时(word1.concat( - )。CONCAT(字词)),我已经试过:

 进口org.apache.spark {SparkConf,SparkContext}对象替换{  高清主(参数:数组[字符串]):单位= {    VAL的conf =新SparkConf()
      .setMaster(本地)
      .setAppName(替换)    VAL SC =新SparkContext(CONF)
    VAL RDD = sc.textFile(数据/ ddd.txt)    VAL阈值= 2    VAL searchBigram = {rdd.map
      _.split('。')地图{串=过夜。;
        //修剪子,然后标记化的空间
        substrings.trim.split('')。          //删除非字母数字字符,并转换为小写
          图{
          _.replaceAll(\\ W,).toLowerCase()
        }。
          滑动(2)      } {.flatMap
        身分
      }
        .MAP {
        _.mkString()
      }
        。通过...分组 {
        身分
      }
        .mapValues​​ {
        _。尺寸
      }
    } {.flatMap
      身分
    } .reduceByKey(_ + _)。收集
      .sortBy(-_._ 2)
      .takeWhile(_._ 2 - =阈值)
      .MAP(X => x._1.split(''))
      .MAP(X =>(X(0),X(1)))。toVector
    VAL SAMPLE1 = sc.textFile(数据/ ddd.txt)
    VAL SAMPLE2 = sample1.map(S = GT; s.split()//空间分割
      .sliding(2)//拍摄连续的对
      .MAP {情况下阵列(A,B)=> (A,B)}
      .MAP(ELEM =方式>如果(searchBigram.contains(ELEM))(elem._1.concat( - )的concat(elem._2),)其他ELEM)
      .MAP {壳体(E1,E2)= GT; E1} .mkString())
    sample2.foreach(的println)
  }
}

但这种code删除每个文档的最后一个字,并表现出一定的错误,当我在文件上运行它包含了很多的文件。

想我的输入文件包含这些文件:

 惊喜听到扑通开的门小邋遢的男子紧握着包缠。升级系统中发现检讨春天2000问题穆迪音频抵押担保。OMG尽快离开得包装查看订单。明白穆迪发出专人送交达赖喇嘛讲手戴耳塞的生活。听保持联系长。缓冲闪电2000伏的电缆烧毁还原到位。伏电缆线终于能听到听觉问题穆迪宝石传闻已久的音乐。

和我最喜欢的输出是:

 惊喜听到扑通开的门小男人抱茎裹包。升级系统中发现检讨春季两千年问题,喜怒无常音频抵押担保。OMG尽快离开得包装查看订单。明白的问题,喜怒无常专人送达达赖喇嘛讲手戴耳塞的生活。听保持长期联系的小男人。缓冲闪电两千年伏电缆烧毁还原到位。电缆伏电缆终于能听到听觉问题,喜怒无常宝石传闻已久的音乐。

任何人可以帮助我吗?


解决方案

 高清getNgrams(句子):
    OUT = []
    森= sentence.split()
    对于(LEN(SEN)-1)K的范围:
        out.append((SEN [K],仙[K + 1]))
    返回了
如果__name__ =='__main__':    尝试:
        LSC = LocalSparkContext.LocalSparkContext(建议,火花:// BigData:7077)
        SC = lsc.getBaseContext()
        SSC = lsc.getSQLContext()
        INFILE =bigramstxt.txt
        孙中山= sc.textFile(INFILE,1)
        V = 1
        BRV = sc.broadcast(v)的
        。wordgroups = sen.flatMap(getNgrams).MAP(拉姆达T:(T,1))reduceByKey(添加).filter(拉姆达T:T [1] GT; brv.value)
        双字母组= wordgroups.collect()
        sc.stop()
        INP =开(INFILE,'R')。阅读()
        打印INP
        对B的双字母组:
            印片b
            INP = inp.replace(。加入(B [0]), - 。加入(B [0]))        打印INP    除:
        提高
        sc.stop()

I want to replace all bigrams which their frequency count is greater than a threshold with this pattern (word1.concat("-").concat(word2)), and i've tried:

import org.apache.spark.{SparkConf, SparkContext}

object replace {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("replace")

    val sc = new SparkContext(conf)
    val rdd = sc.textFile("data/ddd.txt")

    val threshold = 2

    val searchBigram=rdd.map {
      _.split('.').map { substrings =>
        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

          // Remove non-alphanumeric characters and convert to lowercase
          map {
          _.replaceAll( """\W""", "").toLowerCase()
        }.
          sliding(2)

      }.flatMap {
        identity
      }
        .map {
        _.mkString(" ")
      }
        .groupBy {
        identity
      }
        .mapValues {
        _.size
      }
    }.flatMap {
      identity
    }.reduceByKey(_ + _).collect
      .sortBy(-_._2)
      .takeWhile(_._2 >= threshold)
      .map(x=>x._1.split(' '))
      .map(x=>(x(0), x(1))).toVector


    val sample1 = sc.textFile("data/ddd.txt")
    val sample2 = sample1.map(s=> s.split(" ") // split on space
      .sliding(2)                       // take continuous pairs
      .map{ case Array(a, b) => (a, b) }
      .map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
      .map{case (e1,e2) => e1}.mkString(" "))
    sample2.foreach(println)
  }
}

but this code remove last word of every document and show some errors when i run it on a file contains a lot of documents.

suppose my input file contains these documents :

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring two thousand issue moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

buffered lightning two thousand volts cables burned revivification place .

cables volts cables finally able hear auditory issue moody gem long rumored music .

and my favorite output is :

surprise heard thump opened door small-man clasping package wrapped.

upgrading system found review spring two-thousand issue-moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long small-man .

buffered lightning two-thousand volts-cables burned revivification place .

cables volts-cables finally able hear auditory issue-moody gem long rumored music .

Can anybody help me?

解决方案

def getNgrams(sentence):
    out = []
    sen = sentence.split(" ")
    for k in range(len(sen)-1):
        out.append((sen[k],sen[k+1]))
    return out    
if __name__ == '__main__':

    try:
        lsc = LocalSparkContext.LocalSparkContext("Recommendation","spark://BigData:7077")
        sc = lsc.getBaseContext()
        ssc = lsc.getSQLContext()
        inFile = "bigramstxt.txt"
        sen = sc.textFile(inFile,1)
        v = 1
        brv = sc.broadcast(v)
        wordgroups = sen.flatMap(getNgrams).map(lambda t: (t,1)).reduceByKey(add).filter(lambda t: t[1]>brv.value)
        bigrams = wordgroups.collect()
        sc.stop()
        inp = open(inFile,'r').read()
        print inp
        for b in bigrams:
            print b
            inp = inp.replace(" ".join(b[0]),"-".join(b[0]))

        print inp

    except:
        raise
        sc.stop()

这篇关于根据他们在Scala和星火频率更换双字母组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆