如何从文本数据中提取单词呢? [英] How to get bag of words from textual data?

查看：87 发布时间：2020/5/4 9:26:25 python machine-learning text-processing

本文介绍了如何从文本数据中提取单词呢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用大型文本数据集处理预测问题.我正在实现 Bag of Words模型.

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

什么才是最好的方法?现在，我的各个单词都有 tf-idf ，单词的数量太大将其用于进一步的任务.如果我使用tf-idf标准，那么获取单词袋的tf-idf阈值应该是多少?还是我应该使用其他一些算法.我正在使用python.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

推荐答案

使用 collections.Counter类

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
   'John also likes to watch football games.']
>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

这篇关于如何从文本数据中提取单词呢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从文本数据中提取单词呢? [英] How to get bag of words from textual data?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何从文本数据中提取单词呢? [英] How to get bag of words from textual data?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭