如何从文本数据中提取单词呢? [英] How to get bag of words from textual data?

查看:87
本文介绍了如何从文本数据中提取单词呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用大型文本数据集处理预测问题.我正在实现 Bag of Words模型.

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

什么才是最好的方法?现在,我的各个单词都有 tf-idf ,单词的数量太大将其用于进一步的任务.如果我使用tf-idf标准,那么获取单词袋的tf-idf阈值应该是多少?还是我应该使用其他一些算法.我正在使用python.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

推荐答案

使用 collections.Counter类

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
   'John also likes to watch football games.']
>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>> 

这篇关于如何从文本数据中提取单词呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆