word2vec余弦相似度大于1阿拉伯文本 [英] word2vec cosine similarity greater than 1 arabic text

查看:116
本文介绍了word2vec余弦相似度大于1阿拉伯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从 gensim 训练了 word2vec 模型,并且在语料库中找到了某些单词的最近邻居.这是相似度得分:

 الاحتلال的最高邻居:الاحتلال:1.0000001192092896الاختلال:0.9541053175926208الاهتلال:0.872565507888794الاحثلال:0.8386293649673462الاكتلال:0.8209128379821777 

获得大于1的相似度是很奇怪的,因为文本包含许多OCR拼写错误(我是从ORC编辑的文档中获得的),所以我无法对文本应用任何词干.我该如何解决该问题?

注意,我正在使用 model.similarity(t1,t2)

这是我训练Word2Vec模型的方式:

 文档= list()tokenize = lambda x:gensim.utils.simple_preprocess(x)t1 = time.time()docs = read_files(TEXT_DIRS,nb_docs = 5000)t2 = time.time()print('阅读文档花费了{:.3f}分钟'.format((t2-t1)/60))print('文档数:%i'%len(docs))#训练模型模型= gensim.models.Word2Vec(docs,size = EMBEDDING_SIZE,min_count = 5)如果不是os.path.exists(MODEL_DIR):os.makedirs(MODEL_DIR)model.save(os.path.join(MODEL_DIR,'word2vec'))权重= model.wv.vectorsindex_words = model.wv.index2wordvocab_size = weights.shape [0]embedding_dim = weights.shape [1]print('权重形状:',weights.shape)print('词汇量:%i'vocab_size%)print('嵌入大小:%i'%embedding_dim) 

下面是我定义的read_files函数:

  def读取文件(text_directories,nb_docs):"读入文本文件"文件= list()tokenize = lambda x:gensim.utils.simple_preprocess(x)打印('开始阅读...')用于text_directories中的路径:计数= 0#读取目录中的所有文件如果os.path.isdir(path):all_files = os.listdir(路径)用于all_files中的文件名:如果filename.endswith('.txt')和filename [0] .isdigit():计数+ = 1使用open('%s/%s'%(path,filename),encoding ='utf-8')为f:doc = f.read()doc = clean_text_arabic_style(doc)doc = clean_doc(doc)document.append(tokenize(doc))如果计数%100 == 0:print('已处理{}个文件,距离{}'.format(count,path)很远)如果count> = nb_docs并且count< = nb_docs + 200:打印('REACHED END')休息如果count> = nb_docs而count< = nb_docs:打印('REACHED END')休息退回文件 

我尝试了线程,但它无济于事因为我宁愿使用阿拉伯语拼写错误文本

更新我尝试了以下操作:(获得完全相同的单词之间的相似性)

  print(model.similarity('الاحتلال','الاحتلال')) 

它给了我以下结果:

  1.0000001 

解决方案

在定义上,余弦相似性度量应最大为1.0.

但是在实践中,计算机中的浮点数表示形式在深度十进制中几乎没有误差.并且,尤其是当连续进行许多计算时(与该余弦距离的计算一样),这些计算有时会导致与预期的最大或完全正确的答案应该"略有偏差.会.

(类似:有时计算得出,无论以何种顺序进行重新排序/重新组合,以数学方式应得出完全相同的答案.)

但是,由于这些表示误差通常是很小"的,因此它们通常与实际无关.(它们在-1.0到1.0左右的数字范围内特别小,但是在处理巨型数字时可能会变得非常大.)

在您的原始情况下,偏差仅为 0.000000119209289 .在逐字对比的情况下,偏差仅为 0.0000001 .也就是说,大约减少了十分之一.(您的其他 1.0 值与完美计算有相似的微小偏差,但它们并不明显.)

在大多数情况下,您应该忽略它.

如果在数字显示/记录中发现它分散了您或您的用户的注意力,只需选择将所有这些值显示为有限数量的小数点后数字(例如4或什至5或6)即可隐藏那些嘈杂的数字.例如,使用Python 3格式字符串:

  sim = model.similarity('الاحتلال','الاحتلال')print(f''{sim:.6}'') 

(像 numpy 这样的库可以处理大量此类浮点数,甚至可以为显示精度设置全局默认值-请参见this thread but it won't help me because I rather have arabic and misspelled text

Update I tried the following: (getting the similarity between the exact same word)

print(model.similarity('الاحتلال','الاحتلال'))

and it gave me the following result:

1.0000001

解决方案

Definitionally, the cosine-similarity measure should max at 1.0.

But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.

(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)

But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)

In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)

In most cases, you should just ignore it.

If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:

sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")

(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)

If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.

这篇关于word2vec余弦相似度大于1阿拉伯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆