了解scikit CountVectorizer中的min_df和max_df [英] Understanding min_df and max_df in scikit CountVectorizer
问题描述
我有五个输入到CountVectorizer的文本文件.当为CountVectorizer实例指定min_df和max_df时,最小/最大文档频率到底是什么意思?是某个单词在其特定文本文件中的出现频率,还是在整个整体语料库(5个txt文件)中该单词的出现频率?
I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?
以整数或浮点数形式提供min_df和max_df有何不同?
How is it different when min_df and max_df are provided as integers or as floats?
该文档似乎没有提供详尽的解释,也没有提供示例来演示min_df和/或max_df的用法.有人可以提供说明或示例来演示min_df或max_df.
The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df.
推荐答案
max_df
用于删除过于频繁出现的术语,也称为特定于语料库的停用词".例如:
max_df
is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
-
max_df = 0.50
的意思是忽略出现在超过50%的文档中的词语". -
max_df = 25
的意思是忽略出现在超过25个文档中的字词".
max_df = 0.50
means "ignore terms that appear in more than 50% of the documents".max_df = 25
means "ignore terms that appear in more than 25 documents".
默认max_df
为1.0
,这意味着忽略出现在超过100%文档中的字词".因此,默认设置不会忽略任何术语.
The default max_df
is 1.0
, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.
min_df
用于删除不太频繁出现的字词.例如:
min_df
is used for removing terms that appear too infrequently. For example:
-
min_df = 0.01
的意思是忽略出现在少于文档的1%中的字词". -
min_df = 5
的意思是忽略出现在少于5个文档中的术语".
min_df = 0.01
means "ignore terms that appear in less than 1% of the documents".min_df = 5
means "ignore terms that appear in less than 5 documents".
默认min_df
是1
,表示忽略出现在少于1个文档中的字词".因此,默认设置不会忽略任何术语.
The default min_df
is 1
, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.
这篇关于了解scikit CountVectorizer中的min_df和max_df的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!