将自定义词汇用于TfidfVectorizer scikit-learn的问题 [英] Problems using a custom vocabulary for TfidfVectorizer scikit-learn

查看:309
本文介绍了将自定义词汇用于TfidfVectorizer scikit-learn的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit-learn中的自定义词汇表来执行某些聚类任务,并且得到的结果很奇怪.

I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results.

当不使用自定义词汇表时,程序运行正常,我对集群的创建感到满意.但是,我已经确定了一组要用作自定义词汇的单词(大约24,000个).

The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary.

这些单词存储在SQL Server表中.到目前为止,我已经尝试了2种方法,但最终得到的结果是相同的.第一个是创建列表,第二个是创建字典.创建字典的代码如下:

The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second is to create a dictionary. The code for the creation of the dictionary is like this:

myvocab = {}
vocabulary = []

count = 0

for row in results:
    skillName = re.sub(r'&#?[a-z0-9]+;', ' ', row['SkillName']) 
    skillName = unicode(skillName,"utf-8")  
    vocabulary.append(skillName)  #Using a list 
    myvocab[str(skillName)] = count #Using a dictionary
    count+=1

然后我在TfidfVectorizer中使用词汇表(无论是列表版本还是字典,它们都在末尾给出相同的结果),

I then use the vocabulary (either the list version or the dictionary, both of them give the same result at the end) in the TfidfVectorizer as follows:

vectorizer = TfidfVectorizer(max_df=0.8, 
                         stop_words='english' ,ngram_range=(1,2) ,vocabulary=myvocab)
X = vectorizer.fit_transform(dataset2)

X的形状为(651,24321),因为我有651个实例要聚类,词汇表中有24321个单词.

The shape of X is (651, 24321) as I have 651 instances to cluster and 24321 words in the vocabulary.

如果我打印X的内容,这就是我得到的:

If I print the contents of X, this is what I get:

(14, 11462) 1.0
(20, 10218) 1.0
(34, 11462) 1.0
(40, 11462) 0.852815313278
(40, 10218) 0.52221264006
(50, 11462) 1.0
(81, 11462) 1.0
(84, 11462) 1.0
(85, 11462) 1.0
(99, 10218) 1.0
(127, 11462)    1.0
(129, 10218)    1.0
(132, 11462)    1.0
(136, 11462)    1.0
(138, 11462)    1.0
(150, 11462)    1.0
(158, 11462)    1.0
(186, 11462)    1.0
(210, 11462)    1.0

:   :

可以看出,在大多数情况下,仅出现词汇表中的单词(这是错误的,因为至少有10个单词),并且在很多情况下,甚至找不到一个单词.另外,在各个实例中找到的单词往往总是相同的,这是没有道理的.

As it can be seen, for most of the instances, only word from the vocabulary is present (which is wrong as there are at least 10) and for a lot of instances, not even one word is found. Also, the words found tend to be always the same across the instances, which doesn't make sense.

如果我使用:打印特征名称:

If I print the feature_names using :

feature_names = np.asarray(vectorizer.get_feature_names())

我得到:

['.NET' '10K' '21 CFR Part 11' ..., 'Zend Studio' 'Zendesk' 'Zenworks']

我必须说,当所使用的词汇是由输入文档确定的词汇时,该程序运行良好,因此我强烈怀疑该问题与使用自定义词汇有关.

I must say that the program was running perfectly when the vocabulary used was the one determined from the input documents, so I strongly suspect that the problem is related to using a custom vocabulary.

有人知道发生了什么吗?

Does anyone have a clue of what's happening?

(我没有使用管道,因此该问题与先前已修复的错误无关)

(I'm not using a pipeline so this problem can't be related to a previous bug which has already been fixed)

推荐答案

我很确定这是由(c0>的默认值(可能引起混淆)导致的,如果没有出现,则会切断词汇表中的任何功能在数据集中至少两次.您可以通过在代码中明确设置min_df=1来确认吗?

I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2 to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1 in your code?

这篇关于将自定义词汇用于TfidfVectorizer scikit-learn的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆