在 Python 中计算单词 [英] Count Words in Python
问题描述
我在 python 中有一个字符串列表.
I have a list of strings in python.
list = [ "Sentence1.Sentence2...", "Sentence1.Sentence2...",...]
我想删除停用词并计算所有不同字符串组合的每个词的出现次数.有什么简单的方法吗?
I want to remove stop words and count occurrence of each word of all different strings combined. Is there a simple way to do it?
我目前正在考虑使用 scikit 中的 CountVectorizer(),而不是对每个单词进行迭代并组合结果
I am currently thinking of using CountVectorizer() from scikit and than iterating for each word and combining the results
推荐答案
如果你不介意安装一个新的 python 库,我建议你使用 gensim.第一个教程完全符合您的要求:
If you don't mind installing a new python library, I suggest you use gensim. The first tutorial does exactly what you ask:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
然后您需要为您的文档语料库创建字典并创建词袋.
You will then need to create the dictionary for your corpus of document and create the bag-of-words.
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future
print(dictionary)
您可以使用 tf-idf 和其他东西对结果进行加权,然后很容易地进行 LDA.
You can weight the result using tf-idf and stuff and do LDA quite easily after.
查看教程 1 此处
这篇关于在 Python 中计算单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!