TfIdfVectorizer:固定词汇的向量化器如何处理新词? [英] TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

查看:77
本文介绍了TfIdfVectorizer:固定词汇的向量化器如何处理新词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理大约 10 万篇研究论文的语料库.我正在考虑三个领域:

I'm working on a corpus of ~100k research papers. I'm considering three fields:

  1. 纯文本
  2. 标题
  3. 摘要

我使用 TfIdfVectorizer 来获取纯文本字段的 TfIdf 表示,并将由此产生的词汇反馈回标题和摘要的向量化器中,以确保所有三种表示都使用相同的词汇.我的想法是,由于纯文本字段比其他两个字段大得多,因此它的词汇很可能会覆盖其他字段中的所有单词.但如果不是这样的话,TfIdfVectorizer 将如何处理新词/标记?

I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the Vectorizers of title and abstract to assure that all three representations are working on the same vocab. My idea was that since the the plaintext field is much bigger than the other two, it's vocab will most probably cover all the words in the other fields. But how would the TfIdfVectorizer deal with new words/tokens if that wasn't the case?

这是我的代码示例:

vectorizer = TfidfVectorizer(min_df=2)
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
# later in an another script after loading the vocab from disk
vectorizer = TfidfVectorizer(min_df=2, vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)

词汇有大约 90 万个单词.

The vocab has ~900k words.

在矢量化期间我没有遇到任何问题,但后来当我想使用 sklearn.metrics.pairwise.cosine_similarity 比较矢量化标题之间的相似性时,我遇到了这个错误:

During vectorization I didn't ran into any problems but later when I wanted to compare the similarity between the vectorized titles using sklearn.metrics.pairwise.cosine_similarity I ran into this error:

>> titles_sim = cosine_similarity(titles_tfidf)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-237-5aa86fe892da> in <module>()
----> 1 titles_sim = cosine_similarity(titles)

/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
    916         Y_normalized = normalize(Y, copy=True)
    917 
--> 918     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
    919 
    920     return K

/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    184         ret = a * b
    185         if dense_output and hasattr(ret, "toarray"):
--> 186             ret = ret.toarray()
    187         return ret
    188     else:

/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921 
    922     ##############################################################

/usr/local/lib/python3.5/dist-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260 

ValueError: could not convert integer scalar

我不确定它是否相关,但我真的看不出这里出了什么问题.也因为我在计算明文向量的相似性时没有遇到错误.

I'm not really sure if it's related but I can't really see what's going wrong here. Also because I'm not running into the error when calculating the similarities on the plaintext vectors.

我错过了什么吗?有没有更好的方法来使用 Vectorizer?

Am I missing something out? Is there a better way to use the Vectorizer?

稀疏 csr_matrices 的形状是相等的.

The shapes of the sparse csr_matrices are equal.

>> titles_tfidf.shape
(96582, 852885)
>> plaintexts_tfidf.shape
(96582, 852885)

推荐答案

恐怕矩阵太大了.这将是 96582*96582=9328082724 个单元格.尝试切片 titles_tfidf 并检查.

I'm afraid the matrix might be too large. It would be 96582*96582=9328082724 cells. Try to slice titles_tfidf a bit and check.

来源:http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html

美国东部时间:如果您使用的是较旧的 SciPy/Numpy 版本,您可能需要更新:https://github.com/scipy/scipy/pull/4678

EDT: If you are using older SciPy/Numpy version you might want to update: https://github.com/scipy/scipy/pull/4678

EDT2:另外,如果您使用的是 32 位 python,切换到 64 位可能会有所帮助(我想)

EDT2: Also if you are using 32bit python, switching to 64bit might help (I suppose)

EDT3:回答你原来的问题.当您使用 plaintexts 中的词汇并且 titles 中会有新词时,它们将被忽略 - 但不会影响 tfidf 值.希望这段代码能让它更容易理解:

EDT3: Answering your original question. When you use vocabulary from plaintexts and there will be new words in titles they will be ignored - but not influence tfidf value. Hope this snippet may make it more understandable:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

plaintexts =["They are", "plain texts texts amoersand here"]
titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]

vectorizer = TfidfVectorizer()
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
print('values using vocabulary')
print(titles_tfidf)
print(vectorizer.get_feature_names())
print('Brand new vectorizer')
vectorizer = TfidfVectorizer()
titles_tfidf = vectorizer.fit_transform(titles)
print(titles_tfidf)
print(vectorizer.get_feature_names())

结果是:

values using vocabulary
  (0, 2)        1.0
  (3, 3)        0.78528827571
  (3, 2)        0.61913029649
['amoersand', 'are', 'here', 'plain', 'texts', 'they']
Brand new vectorizer
  (0, 0)        0.78528827571
  (0, 4)        0.61913029649
  (1, 6)        1.0
  (2, 7)        0.57735026919
  (2, 2)        0.57735026919
  (2, 3)        0.57735026919
  (3, 4)        0.486934264074
  (3, 1)        0.617614370976
  (3, 5)        0.617614370976
['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']

请注意,它不一样,因为我会从标题中删除未出现在明文中的单词.

Notice it is not the same as I would remove words that not occur in plaintexts from titles.

这篇关于TfIdfVectorizer:固定词汇的向量化器如何处理新词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆