Spark MLib Word2Vec 错误:词汇量应大于等于0 [英] Spark MLib Word2Vec Error: The vocabulary size should be > 0

查看:26
本文介绍了Spark MLib Word2Vec 错误:词汇量应大于等于0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Spark 的 MLLib 实现词向量化.我正在按照此处给出的示例进行操作.>

我有一堆句子想作为输入来训练模型.但我不确定这个模型是采用句子还是只是将所有单词作为字符串序列.

我的输入如下:

scala>v.take(5)res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoes, ?)]), List([WrappedArray(since, eliud, win, ,,快速,事实,来自,亚军,来自,国家,kalenjins,!,写,.,快乐,快速,事实,kalenjins,位置,:,肯尼亚,(,肯尼亚的,西部,高地,裂谷,山谷,),人口, :, 4, ., 9, 百万, ;, compose, 11, subtribes, language, :, kalenjin, ;, 斯瓦希里语, ;, 英语, 教堂, :, 基督教, ~, 非洲, 内陆, 教堂, [, aic, ],, 教堂, 省, 肯尼亚, [, cpk, ],, 罗马, 天主教, 教堂, ;, 伊斯兰教, 翻译, :, kalenjin, 翻译, ", tell, ", 形成, :, wwii, ,, gikuyu,部落,成员,愿望,分离,创建,身份,.,以后,,,学生,参加,联盟,高中,学校,(,第一,英国,公共,学校,肯尼亚,),形式,...

但是当我尝试在这个输入上训练我的 word2vec 模型时,它不起作用.

scala>val word2vec = new Word2Vec()word2vec:org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040标度>val 模型 = word2vec.fit(v)java.lang.IllegalArgumentException:要求失败:词汇量大小应该是>0. 您可能需要检查 minCount 的设置,该设置可能足够大以删除句子中的所有单词.

Word2Vec 不把句子作为输入吗?

解决方案

您的输入是正确的.但是,Word2Vec 会自动删除在词汇表(所有句子组合)中出现次数未达到最少的单词.默认情况下,此值为 5.在您的情况下,很可能没有单词在您使用的数据中出现 5 次或更多次.

使用 setMinCount() 更改所需的最少单词出现次数,例如最小计数为 2:

val word2vec = new Word2Vec().setMinCount(2)

I am trying to implement word vectorization using Spark's MLLib. I am following the example given here.

I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string.

My input is as below:

scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...

But when I try to train my word2vec model on this input it does not work.

scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040

scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.

Does Word2Vec not take sentences as input?

解决方案

Your input is correct. However, Word2Vec will automatically remove words that do not occur a minimum number of times in the vocabulary (all sentences combined). By default this value is 5. In your case, it is highly likely that no word occurs 5 or more times in the data you use.

To change the minimum required word occurrences use setMinCount(), for example a min count of 2:

val word2vec = new Word2Vec().setMinCount(2)

这篇关于Spark MLib Word2Vec 错误:词汇量应大于等于0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆