word2vec余弦相似度大于1个阿拉伯语文本 [英] word2vec cosine similarity greater than 1 arabic text

查看:19
本文介绍了word2vec余弦相似度大于1个阿拉伯语文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从 gensim 训练了我的 word2vec 模型,并且我正在获取语料库中某些单词的最近邻.以下是相似度得分:

 الاحتلال 的顶级邻居:الاحتلال: 1.0000001192092896الاختلال: 0.9541053175926208الاهتلال: 0.872565507888794الاحثلال: 0.8386293649673462الاكتلال: 0.8209128379821777

获得大于 1 的相似度很奇怪.我无法对我的文本应用任何词干,因为文本包含许多 OCR 拼写错误(我从 ORC 文档中获取文本).我该如何解决这个问题?

注意我正在使用 model.similarity(t1, t2)

这就是我训练 Word2Vec 模型的方式:

 文档 = list()tokenize = lambda x: gensim.utils.simple_preprocess(x)t1 = time.time()docs = read_files(TEXT_DIRS, nb_docs=5000)t2 = time.time()打印('阅读文档花了:{:.3f} 分钟'.format((t2 - t1)/60))print('文档数:%i' % len(docs))# 训练模型模型 = gensim.models.Word2Vec(文档,大小=EMBEDDING_SIZE,min_count=5)如果不是 os.path.exists(MODEL_DIR):os.makedirs(MODEL_DIR)模型.save(os.path.join(MODEL_DIR,'word2vec'))权重 = model.wv.vectorsindex_words = model.wv.index2wordvocab_size = weights.shape[0]embedding_dim = weights.shape[1]print('权重的形状:', weights.shape)print('词汇量:%i' % vocab_size)print('嵌入尺寸:%i' % embedding_dim)

下面是我定义的 read_files 函数:

def read_files(text_directories, nb_docs):"读入文本文件"文件 = 列表()tokenize = lambda x: gensim.utils.simple_preprocess(x)print('开始阅读……')对于 text_directories 中的路径:计数 = 0# 读入目录下的所有文件如果 os.path.isdir(path):all_files = os.listdir(path)对于 all_files 中的文件名:如果 filename.endswith('.txt') 和 filename[0].isdigit():计数 += 1with open('%s/%s' % (path, filename), encoding='utf-8') as f:doc = f.read()doc = clean_text_arabic_style(doc)doc = clean_doc(doc)文档.附加(标记化(文档))如果计数 % 100 == 0:print('processed {} files so far from {}'.format(count, path))如果计数 >= nb_docs 并且计数 <= nb_docs + 200:打印('到达结束')休息如果计数 >= nb_docs 并且计数 <= nb_docs:打印('到达结束')休息退回文件

我尝试了这个线程,但对我没有帮助因为我宁愿有 arabic拼错 文本

更新我尝试了以下方法:(获得完全相同单词之间的相似性)

print(model.similarity('الاحتلال','الاحتلال'))

它给了我以下结果:

1.0000001

解决方案

根据定义,余弦相似度度量应最大为 1.0.

但在实践中,计算机中的浮点数表示在深小数中存在微小的不精确性.并且,尤其是当连续进行大量计算时(如计算该余弦距离),这些计算有时会导致与预期的最大值或完全正确的答案应该"出现轻微偏差.

(同样:有时在数学上,无论如何重新排序/重新分组,计算结果都应该完全相同的计算在以不同顺序进行时会略有不同.)

但是,由于这些表示错误通常非常小",因此它们通常不具有实际意义.(它们在 -1.0 到 1.0 左右的数字范围内特别小,但在处理大数字时会变得非常大.)

在您的原始情况下,偏差仅为 0.000000119209289.在 word-to-itself 的情况下,偏差仅为 0.0000001.也就是说,大约减百万分之一.(您的其他子 1.0 值与完美计算有类似的微小偏差,但它们并不明显.)

在大多数情况下,您应该忽略它.

如果您发现它在数字显示/记录中让您或您的用户分心,只需选择将所有此类值显示为有限数量的小数点后数字(例如 4 或什至 5 或 6)即可隐藏那些嘈杂的数字.例如,使用 Python 3 格式字符串:

sim = model.similarity('الاحتلال','الاحتلال')打印(f{sim:.6}")

(像 numpy 这样的库可以处理大量此类浮点数,甚至可以设置显示精度的全局默认值 – 请参阅 numpy.set_print_options - 尽管这不应该影响您正在检查的原始 Python 浮点数.)

如果出于某种原因您绝对需要将值限制在 1.0,您可以添加额外的代码来做到这一点.但是,选择测试和测试通常是一个更好的主意.打印输出是健壮的,&忘记了,与完美数学的如此微小的偏差.

I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:

 top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777

It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?

Note I am using model.similarity(t1, t2)

This is how I trained my Word2Vec Model:

    documents = list()
    tokenize = lambda x: gensim.utils.simple_preprocess(x)
    t1 = time.time()
    docs = read_files(TEXT_DIRS, nb_docs=5000)
    t2 = time.time()
    print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
    print('Number of documents: %i' % len(docs))

    # Training the model
    model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
    if not os.path.exists(MODEL_DIR):
        os.makedirs(MODEL_DIR)
    model.save(os.path.join(MODEL_DIR, 'word2vec'))

    weights = model.wv.vectors
    index_words = model.wv.index2word

    vocab_size = weights.shape[0]
    embedding_dim = weights.shape[1]

    print('Shape of weights:', weights.shape)
    print('Vocabulary size: %i' % vocab_size)
    print('Embedding size: %i' % embedding_dim)

Below is the read_files function I defined:

def read_files(text_directories, nb_docs):
    """
    Read in text files
    """
    documents = list()
    tokenize = lambda x: gensim.utils.simple_preprocess(x)
    print('started reading ...')
    for path in text_directories:
        count = 0
        # Read in all files in directory
        if os.path.isdir(path):
            all_files = os.listdir(path)
            for filename in all_files:
                if filename.endswith('.txt') and filename[0].isdigit():
                    count += 1
                    with open('%s/%s' % (path, filename), encoding='utf-8') as f:
                        doc = f.read()
                        doc = clean_text_arabic_style(doc)
                        doc = clean_doc(doc)
                        documents.append(tokenize(doc))
                        if count % 100 == 0:
                            print('processed {} files so far from {}'.format(count, path))
                if count >= nb_docs and count <= nb_docs + 200:
                    print('REACHED END')
                    break
        if count >= nb_docs and count <= nb_docs:
            print('REACHED END')
            break

    return documents

I tried this thread but it won't help me because I rather have arabic and misspelled text

Update I tried the following: (getting the similarity between the exact same word)

print(model.similarity('الاحتلال','الاحتلال'))

and it gave me the following result:

1.0000001

解决方案

Definitionally, the cosine-similarity measure should max at 1.0.

But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.

(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)

But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)

In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)

In most cases, you should just ignore it.

If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:

sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")

(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)

If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.

这篇关于word2vec余弦相似度大于1个阿拉伯语文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆