SMOTE 初始化期望 n_neighbors <= n_samples,但 n_samples <;n_neighbors [英] SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors
问题描述
我已经预先清理了数据,下面是前4行的格式:
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
我按如下方式调用了 train_test_split():
I have called train_test_split() as follows:
[IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
[Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)
然后,我使用以下 TfidfVectorizer 和拟合/转换程序对 X 训练和测试数据进行了矢量化:
I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
X_train = v.fit_transform(X_train)
X_test = v.transform(X_test)
我现在处于我通常会应用分类器等的阶段(如果这是一组平衡的数据).但是,我初始化 imblearn 的 SMOTE() 类(执行过采样)...
I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...
[IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
smote_model = smote_pipeline.fit(X_train, y_train)
smote_prediction = smote_model.predict(X_test)
...但这会导致:
[OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.
我试图减少 n_neighbors 的数量但无济于事,任何提示或建议将不胜感激.感谢阅读.
I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.
------------------------------------------------------------------------------------------------------------------------------------
数据集/数据框 (df
) 包含跨两列的 2380 行,如上面的 df.head()
所示.X_train
以字符串列表 (df['cleaned']
) 的格式包含这些行中的 1785 行,y_train
也包含 1785 行字符串的格式(df['Year']
).
The dataset/dataframe (df
) contains 2380 rows across two columns, as shown in df.head()
above. X_train
contains 1785 of these rows in the format of a list of strings (df['cleaned']
) and y_train
also contains 1785 rows in the format of strings (df['Year']
).
使用TfidfVectorizer()
进行后向量化:X_train
和X_test
是从pandas.core.series.Series
转换而来的code> 形状分别为 '(1785,)' 和 '(595,)',到 scipy.sparse.csr.csr_matrix
形状为 '(1785, 126459)' 和 '(595, 126459)' 分别.
Post-vectorization using TfidfVectorizer()
: X_train
and X_test
are converted from pandas.core.series.Series
of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix
of shape '(1785, 126459)' and '(595, 126459)' respectively.
关于类的数量:使用Counter()
,我计算出有199个类(Years),一个类的每个实例都附加到上述的一个元素上df['cleaned']
包含从文本语料库中提取的字符串列表的数据.
As for the number of classes: using Counter()
, I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned']
data which contains a list of strings extracted from a textual corpus.
此过程的目标是根据现有词汇自动确定/猜测输入文本数据的年份、十年或世纪(任何程度的分类都可以!).
The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.
推荐答案
由于训练集中大约有 200 个类和 1800 个样本,因此平均每个类有 9 个样本.错误消息的原因是 a) 可能数据不完全平衡并且存在少于 6 个样本的类和 b) 邻居数为 6.针对您的问题的一些解决方案:
Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:
计算199个类中的最小样本数(n_samples),选择小于或等于n_samples的SMOTE类的
n_neighbors
参数.
使用 n_samples
排除对类进行过采样n_neighbors 使用 SMOTE
类的 ratio
参数.
Exclude from oversampling the classes with n_samples < n_neighbors using the ratio
parameter of SMOTE
class.
使用没有类似限制的 RandomOverSampler
类.
Use RandomOverSampler
class which does not have a similar restriction.
结合 3 和 4 解决方案:创建一个使用 SMOTE
和 RandomOversampler
的管道,以满足条件 n_neighbors <= n_samples for smoted classes并在不满足条件时使用随机过采样.
Combine 3 and 4 solutions: Create a pipeline that is using SMOTE
and RandomOversampler
in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.
这篇关于SMOTE 初始化期望 n_neighbors <= n_samples,但 n_samples <;n_neighbors的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!