Python中的NLP:矢量化后从SelectKBest中获取单词名称 [英] NLP in Python: Obtain word names from SelectKBest after vectorizing

查看:220
本文介绍了Python中的NLP:矢量化后从SelectKBest中获取单词名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎找不到确切问题的答案.有人可以帮忙吗?

I can't seem to find an answer to my exact problem. Can anyone help?

对我的数据框("df")的简化描述:它有2列:一列是一堆文本("Notes"),另一列是一个二进制变量,指示解析时间是否高于平均水平( "y").

A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").

我在文字上用了字眼:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])

我的矩阵是6290 x4650.获取单词名称(即功能名称)没问题:

My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :

feature_names = vectorizer.get_feature_names()
feature_names

接下来,我想知道这4650中的哪一个与高于平均分辨率时间最相关;并减少我可能要在预测模型中使用的矩阵.我进行了卡方检验,以找出前20个最重要的单词.

Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]

现在,我被卡住了.我如何从这个简化的矩阵("chi_matrix")中得到单词?我的功能名称是什么?我正在尝试:

Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:

chi_matrix.feature_names[selector.get_support(indices=True)].tolist()

chi_matrix.feature_names[features.get_support()]

这些给我一个错误:找不到feature_names.我想念什么?

These gives me an error: feature_names not found. What am I missing?

A

推荐答案

弄清了我真正想做的事情(感谢Daniel)并进行了更多研究之后,我发现了实现目标的其他几种方法.

After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.

方法1- https://glowingpython. blogspot.com/2014/02/terms-selection-with-chi-square.html

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]

wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show

方法2-这是我使用的方法,因为它是我最容易理解的方法,并产生了一个不错的输出,其中列出了单词,chi2得分和p值.此处的另一个主题: Sklearn Chi2用于功能选择

Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

y = df['AboveAverage']

# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), 
                                       columns=['ftr', 'score', 'pval'])
chi2_scores

这篇关于Python中的NLP:矢量化后从SelectKBest中获取单词名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆