文本分析:使用python查找列中最常见的单词 [英] Text analysis: finding the most common word in a column using python

查看:99
本文介绍了文本分析:使用python查找列中最常见的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个仅包含主题行列的数据框.

I have created a dataframe with just a column with the subject line.

df = activities.filter(['Subject'],axis=1)
df.shape

这返回了此数据框:

    Subject
0   Call Out: Quadria Capital - May Lo, VP
1   Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2   Columbia Partners: WW Worked (Not Sure Will Ev...
3   Meeting, Sophie, CFO, CDC Investment
4   Prospecting

然后我尝试使用以下代码分析文本:

I then tried to analyse the text with this code:

import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)

我收到的错误消息是:系列"对象没有属性主题"

The error message I get is: 'Series' object has no attribute 'Subject'

推荐答案

由于在此行中已将df转换为Series,因此引发了错误:

The error is being thrown because you have converted df to a Series in this line:

df = activities.filter(['Subject'],axis=1)

所以当你说:

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

df是Series,没有属性Series.尝试替换为:

df is the Series and does not have the attribute Series. Try replacing with:

txt = df.str.lower().str.replace(r'\|', ' ')

或者也可以不要先将DataFrame过滤到单个Series上,然后

Or alternatively don't filter your DataFrame to a single Series before and then

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

应该工作.

[更新]

我上面所说的是错误的,因为指出过滤器不会返回Series,而是返回具有单个列的DataFrame.

What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.

这篇关于文本分析:使用python查找列中最常见的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆