Spark MLib Word2Vec错误:词汇量应为> 0 [英] Spark MLib Word2Vec Error: The vocabulary size should be > 0

查看:337
本文介绍了Spark MLib Word2Vec错误:词汇量应为> 0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark的MLLib实现单词向量化.我在此处中给出的示例.. >

我想提供一堆句子,作为训练模型的输入.但是不确定此模型是采用句子还是仅将所有单词作为字符串序列.

我的输入如下:

scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...

但是当我尝试在此输入上训练我的word2vec模型时,它将无法正常工作.

scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040

scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.

Word2Vec是否不接受句子作为输入?

解决方案

您的输入正确.但是,Word2Vec将自动删除词汇表中未出现最少次数的单词(所有句子组合在一起).默认情况下,此值为5.在您的情况下,极有可能在您使用的数据中没有单词出现5次或更多次.

要更改所需的最小单词出现次数,请使用setMinCount(),例如最小计数为2:

val word2vec = new Word2Vec().setMinCount(2)

I am trying to implement word vectorization using Spark's MLLib. I am following the example given here.

I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string.

My input is as below:

scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...

But when I try to train my word2vec model on this input it does not work.

scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040

scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.

Does Word2Vec not take sentences as input?

解决方案

Your input is correct. However, Word2Vec will automatically remove words that do not occur a minimum number of times in the vocabulary (all sentences combined). By default this value is 5. In your case, it is highly likely that no word occurs 5 or more times in the data you use.

To change the minimum required word occurrences use setMinCount(), for example a min count of 2:

val word2vec = new Word2Vec().setMinCount(2)

这篇关于Spark MLib Word2Vec错误:词汇量应为> 0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆