SMOTE 初始化期望 n_neighbors <= n_samples，但 n_samples <;n_neighbors [英] SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

查看：60 发布时间：2021/12/25 14:43:30 scikit-learn knn tf-idf oversampling imblearn

本文介绍了SMOTE 初始化期望 n_neighbors <= n_samples，但 n_samples <;n_neighbors的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经预先清理了数据，下面是前4行的格式:

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

我按如下方式调用了 train_test_split():

I have called train_test_split() as follows:

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

然后，我使用以下 TfidfVectorizer 和拟合/转换程序对 X 训练和测试数据进行了矢量化:

I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

我现在处于我通常会应用分类器等的阶段(如果这是一组平衡的数据).但是，我初始化 imblearn 的 SMOTE() 类(执行过采样)...

I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

...但这会导致:

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

我试图减少 n_neighbors 的数量但无济于事，任何提示或建议将不胜感激.感谢阅读.

I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.

------------------------------------------------------------------------------------------------------------------------------------

完整追溯

数据集/数据框 (df) 包含跨两列的 2380 行，如上面的 df.head() 所示.X_train 以字符串列表 (df['cleaned']) 的格式包含这些行中的 1785 行，y_train 也包含 1785 行字符串的格式(df['Year']).

The dataset/dataframe (df) contains 2380 rows across two columns, as shown in df.head() above. X_train contains 1785 of these rows in the format of a list of strings (df['cleaned']) and y_train also contains 1785 rows in the format of strings (df['Year']).

使用TfidfVectorizer()进行后向量化:X_train和X_test是从pandas.core.series.Series转换而来的code> 形状分别为 '(1785,)' 和 '(595,)'，到 scipy.sparse.csr.csr_matrix 形状为 '(1785, 126459)' 和 '(595, 126459)' 分别.

Post-vectorization using TfidfVectorizer(): X_train and X_test are converted from pandas.core.series.Series of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix of shape '(1785, 126459)' and '(595, 126459)' respectively.

关于类的数量:使用Counter()，我计算出有199个类(Years)，一个类的每个实例都附加到上述的一个元素上df['cleaned'] 包含从文本语料库中提取的字符串列表的数据.

As for the number of classes: using Counter(), I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned'] data which contains a list of strings extracted from a textual corpus.

此过程的目标是根据现有词汇自动确定/猜测输入文本数据的年份、十年或世纪(任何程度的分类都可以！).

The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.

SMOTE 初始化期望 n_neighbors <= n_samples，但 n_samples <;n_neighbors [英] SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SMOTE 初始化期望 n_neighbors <= n_samples，但 n_samples <;n_neighbors [英] SMOTE initialisation expects n_neighbors &lt;= n_samples, but n_samples &lt; n_neighbors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

SMOTE 初始化期望 n_neighbors <= n_samples，但 n_samples <;n_neighbors [英] SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

登录关闭