tf-idf 用于大量 (65k) 文本文件 [英] tf-idf on a somewhat large (65k) amount of text files

查看:68
本文介绍了tf-idf 用于大量 (65k) 文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 scikit-learn(或 nltk 或我愿意接受其他建议)尝试 tfidf.我拥有的数据是我们抓取并存储在 mongoDB 中的相对大量的讨论论坛帖子(~65k).每个帖子都有帖子标题、帖子的日期和时间、帖子消息的文本(或回复:如果是对现有帖子的回复)、用户名、消息 ID 以及它是子帖子还是父帖子(在线程中),你有原始帖子,然后回复这个操作,或者嵌套回复,树).

我认为每个帖子都是一个单独的文档,类似于 20 个新闻组,每个文档都有我在顶部提到的字段,底部是我将从 mongo 和写入每个文本文件所需的格式.

为了将数据加载到 scikit,我知道:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html(但我的数据没有分类)http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - 对于输入,我知道我会使用文件名,但是因为我会有大量文件(每个帖子),有没有办法从文本文件中读取文件名?或者有人可以指出一些示例实现吗?

此外,关于为每个这些讨论论坛帖子构建文件名的任何建议,以便以后确定何时获得 tfidf 向量和余弦相似度数组

谢谢

解决方案

你可以通过一个 python 生成器 或文件名或字符串对象的生成器表达式而不是列表,因此可以随时从驱动器延迟加载数据.这是一个 CountVectorizer 以生成器表达式作为参数的玩具示例:

<预><代码>>>>从 sklearn.feature_extraction.text 导入 CountVectorizer>>>CountVectorizer().fit_transform(('a' * i for i in xrange(100)))<100x98 类型的稀疏矩阵 '<type 'numpy.int64'>'以压缩稀疏列格式存储 98 个元素>

请注意,生成器支持可以直接从 MongoDB 查询结果迭代器中矢量化数据,而不是通过文件名.

还有一个包含 10 个字符的 65k 个文件名的列表,每个文件名在内存中只有 650kB(+ python 列表的开销),因此无论如何提前加载所有文件名应该不成问题.

<块引用>

关于为每个这些论坛帖子构建文件名的任何建议,以便以后确定何时获得 tfidf 向量和余弦相似度数组

在将文件名提供给矢量化器之前,只需使用确定性排序即可对文件名列表进行排序.

I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).

I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.

For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized) http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?

Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Thanks

解决方案

You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer taking a generator expression as argument:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
    with 98 stored elements in Compressed Sparse Column format>

Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.

Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.

any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.

这篇关于tf-idf 用于大量 (65k) 文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆