tfidf矢量化器的前2000个单词的共现矩阵 [英] Co occurance matrix for tfidf vectorizer for top 2000 words

查看:133
本文介绍了tfidf矢量化器的前2000个单词的共现矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为文本数据计算了tfidf矢量化器,得到的矢量为(100000,2000)max_feature = 2000.

i computed tfidf vectorizer for text data and got vectors as (100000,2000) max_feature = 2000.

当我通过以下代码计算共现矩阵时.

while i am computing the co occurance matrix by below code.

length = 2000
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
    print(i)
    print(word)
    for j in range(max(i-window,0),min(i+window,length)):
        print(j)
        print(sentence[j])
        m[word,sentence[j]]+=1
for sentence in tf_vec:
    cal_occ(sentence, m)

我遇到以下错误.

0
(0, 1210)   0.20426932204609685
(0, 191)    0.23516811545499153
(0, 592)    0.2537746177804585
(0, 1927)   0.2896119458034052
(0, 1200)   0.1624114163299802
(0, 1856)   0.24376566018277918
(0, 1325)   0.2789314085220367
(0, 756)    0.15365704375851477
(0, 1130)   0.293489555928974
(0, 346)    0.21231046306681553
(0, 557)    0.2036759579760878
(0, 1036)   0.29666992324872365
(0, 264)    0.36435609585838674
(0, 1701)   0.242619998334931
(0, 1939)   0.33934107208095693
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-96-ad505b6df734> in <module>()
 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 ---> 13     cal_occ(sentence, m)

 <ipython-input-96-ad505b6df734> in cal_occ(sentence, m)
  9             print(j)
 10             print(sentence[j])
 ---> 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 13     cal_occ(sentence, m)

IndexError:只有整数,切片(:),省略号(...),numpy.newaxis(None)和整数或布尔数组都是有效索引

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

推荐答案

您最有可能在这里遇到问题:

You are having the problem most probably here:

for j in range(max(i-window,0),min(i+window,length)):

min 函数在 i + window 超出范围时返回长度,您可以尝试使用此方法代替上面的行吗?

min function returns length when i+window exceeds the bound, can you try this instead of the line above:

for j in range(max(i-window,0),min(i+window,length-1)):

希望这会有所帮助,

欢呼

这篇关于tfidf矢量化器的前2000个单词的共现矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆