在Spark中加载Word2Vec模型 [英] Load Word2Vec model in Spark

查看:643
本文介绍了在Spark中加载Word2Vec模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以加载预训练的(二进制)模型来激发(使用scala)?我试图加载由Google生成的二进制模型之一,如下所示:

Is it possible to load a pretrained (binary) model to spark (using scala) ? I have tried to load one of the binary models which was generated by google like this:

    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}


    val model = Word2VecModel.load(sc, "GoogleNews-vectors-negative300.bin")

,但无法找到元数据目录.我还创建了文件夹,并将二进制文件附加到该文件夹​​中,但是无法解析.我没有找到这个问题的包装.

but it is not able to locate the metadata directory. I also created the folder and appended the binary file there but it cannot be parsed. I did not find any wrapper for this issue.

推荐答案

我编写了一个快速函数,用于将Google新闻预训练模型加载到spark word2vec模型中.享受.

I wrote a quick function to load in the google news pretrained model into a spark word2vec model. Enjoy.

def loadBin(file: String) = {
  def readUntil(inputStream: DataInputStream, term: Char, maxLength: Int = 1024 * 8): String = {
    var char: Char = inputStream.readByte().toChar
    val str = new StringBuilder
    while (!char.equals(term)) {
      str.append(char)
      assert(str.size < maxLength)
      char = inputStream.readByte().toChar
    }
    str.toString
  }
  val inputStream: DataInputStream = new DataInputStream(new GZIPInputStream(new FileInputStream(file)))
  try {
    val header = readUntil(inputStream, '\n')
    val (records, dimensions) = header.split(" ") match {
      case Array(records, dimensions) => (records.toInt, dimensions.toInt)
    }
    new Word2VecModel((0 until records).toArray.map(recordIndex => {
      readUntil(inputStream, ' ') -> (0 until dimensions).map(dimensionIndex => {
        java.lang.Float.intBitsToFloat(java.lang.Integer.reverseBytes(inputStream.readInt()))
      }).toArray
    }).toMap)
  } finally {
    inputStream.close()
  }
}

这篇关于在Spark中加载Word2Vec模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆