查找每个班级中TF-IDF得分最高的前n个术语 [英] Find top n terms with highest TF-IDF score per class
问题描述
让我们假设我在pandas
中有一个包含两列的数据框,类似于以下内容:
Let's suppose that I have a dataframe with two columns in pandas
which resembles the following one:
text label
0 This restaurant was amazing Positive
1 The food was served cold Negative
2 The waiter was a bit rude Negative
3 I love the view from its balcony Positive
,然后在此数据集上使用sklearn
中的TfidfVectorizer
.
and then I am using TfidfVectorizer
from sklearn
on this dataset.
根据每个班级的TF-IDF得分词汇量,找到前n个最有效的方法是什么?
What is the most efficient way to find the top n in terms of TF-IDF score vocabulary per class?
显然,我的实际数据帧包含的数据行比上面的4行还要多.
Apparently, my actual dataframe consists of many more rows of data than the 4 above.
我的文章的重点是找到适用于任何与上述数据框相似的数据框的代码; 4行或1M行数据框.
The point of my post to find the code which works for any dataframe which resembles the one above; either 4-rows dataframe or 1M-rows dataframe.
我认为我的帖子与以下帖子有很多联系:
I think that my post is related quite a lot to the following posts:
- Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
- How to see top n entries of term-document matrix after tfidf in scikit-learn
推荐答案
在下面,您可以找到我三年前为类似目的编写的一段代码.我不确定这是否是您要做的最有效的方式,但是据我所记得,它对我有用.
In the following, you can find a piece of code I wrote more than three years ago for a similar purpose. I'm not sure if this is the most efficient way of doing what you're going to do, but as far as I remember, it worked for me.
# X: data points
# y: targets (data points` label)
# vectorizer: TFIDF vectorizer created by sklearn
# n: number of features that we want to list for each class
# target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
# --------------------------------------------
# splitting X vectors based on target classes
for label in target_list:
# listing the most important words in each class
indices = []
current_dict = {}
# finding indices the of rows (data points) for the current class
for i in range(0, len(X.toarray())):
if y[i] == label:
indices.append(i)
# get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
vectors = np.mean(X[indices, :], axis=0)
# creating a dictionary of features with their corresponding values
for i in range(0, X.shape[1]):
current_dict[X.indices[i]] = vectors.item((0, i))
# sorting the dictionary based on values
sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)
# printing the features textual and numeric values
index = 1
for element in sorted_dict:
for key_, value_ in vectorizer.vocabulary_.items():
if element[0] == value_:
print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
index += 1
if index == n:
break
else:
continue
break
这篇关于查找每个班级中TF-IDF得分最高的前n个术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!