Tfidfvectorizer - 如何查看已处理的令牌? [英] Tfidfvectorizer - How can I check out processed tokens?

查看：44 发布时间：2021/9/6 19:59:55 python scikit-learn nlp tf-idf tfidfvectorizer

本文介绍了Tfidfvectorizer - 如何查看已处理的令牌?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何检查在 TfidfVertorizer() 中标记的字符串?如果我没有在参数中传递任何内容，TfidfVertorizer() 将使用一些预定义的方法标记字符串.我想观察它如何标记字符串，以便我可以更轻松地调整我的模型.

How can I check the strings tokenized inside TfidfVertorizer()? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

我想要这样的东西:

>>>vectorizer.get_processed_tokens()
[['this', 'is', 'first', 'document'],
 ['this', 'document', 'is', 'second', 'document'],
 ['this', 'is', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']]

我该怎么做?

推荐答案

build_tokenizer() 正好可以达到这个目的.

build_tokenizer() would exactly serve this purpose.

试试这个！

tokenizer = lambda docs: [vectorizer.build_tokenizer()(doc) for doc in docs]

tokenizer(corpus)

输出:

[['This', 'is', 'the', 'first', 'document'],
 ['This', 'document', 'is', 'the', 'second', 'document'],
 ['And', 'this', 'is', 'the', 'third', 'one'],
 ['Is', 'this', 'the', 'first', 'document']]

一种衬垫解决方案是

list(map(vectorizer.build_tokenizer(),corpus))

这篇关于Tfidfvectorizer - 如何查看已处理的令牌?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Tfidfvectorizer - 如何查看已处理的令牌? [英] Tfidfvectorizer - How can I check out processed tokens?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Tfidfvectorizer - 如何查看已处理的令牌? [英] Tfidfvectorizer - How can I check out processed tokens?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭