理解 scikit CountVectorizer 中的 min_df 和 max_df [英] Understanding min_df and max_df in scikit CountVectorizer
问题描述
我有五个文本文件输入到 CountVectorizer.将 min_df
和 max_df
指定给 CountVectorizer 实例时,最小/最大文档频率究竟意味着什么?是某个词在其特定文本文件中的出现频率还是该词在整个语料库(五个文本文件)中的出现频率?
I have five text files that I input to a CountVectorizer. When specifying min_df
and max_df
to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?
min_df
和 max_df
作为整数或浮点数提供时有什么区别?
What are the differences when min_df
and max_df
are provided as integers or as floats?
文档 似乎没有提供详尽的解释,也没有提供示例来演示这两个参数的使用.有人可以提供一个解释或示例来演示 min_df
和 max_df
吗?
The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df
and max_df
?
推荐答案
max_df
用于移除出现过于频繁的术语,也称为语料库特定停止字".例如:
max_df
is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
max_df = 0.50
表示忽略出现在超过 50% 的文档中的术语".max_df = 25
表示忽略出现在超过 25 个文档中的术语".
max_df = 0.50
means "ignore terms that appear in more than 50% of the documents".max_df = 25
means "ignore terms that appear in more than 25 documents".
默认的 max_df
是 1.0
,这意味着忽略出现在超过 100% 的文档中的术语".因此,默认设置不会忽略任何术语.
The default max_df
is 1.0
, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.
min_df
用于删除出现频率太低的术语.例如:
min_df
is used for removing terms that appear too infrequently. For example:
min_df = 0.01
表示忽略出现在不到 1% 的文档中的术语".min_df = 5
表示忽略出现在少于 5 个文档中的术语".
min_df = 0.01
means "ignore terms that appear in less than 1% of the documents".min_df = 5
means "ignore terms that appear in less than 5 documents".
默认的 min_df
是 1
,意思是忽略出现在少于 1 个文档中的术语".因此,默认设置不会忽略任何术语.
The default min_df
is 1
, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.
这篇关于理解 scikit CountVectorizer 中的 min_df 和 max_df的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!