根据Python中较小的数据集生成较大的综合数据集 [英] Generate larger synthetic dataset based on a smaller dataset in Python

查看：61 发布时间：2020/5/4 9:41:53 python machine-learning scikit-learn imputation

本文介绍了根据Python中较小的数据集生成较大的综合数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含21000行(数据样本)和102列(功能)的数据集.我想有一个更大的基于当前数据集的综合数据集，比如说有100000行，因此我可以将其用于机器学习.

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby.

在这篇文章中，我一直指的是@Prashant的答案 https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data ，但无法使其为我的数据生成更大的综合数据集.

I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data.

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#referring to https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data


df = pd.read_pickle('df_saved.pkl')
df = df.iloc[:,:-1] # this gives me df, the final Dataframe which I would like to generate a larger dataset based on. This is the smaller Dataframe with 21000x102 dimensions.


def SMOTE(T, N, k):
# """
# Returns (N/100) * n_minority_samples synthetic minority samples.
#
# Parameters
# ----------
# T : array-like, shape = [n_minority_samples, n_features]
#     Holds the minority samples
# N : percetange of new synthetic samples:
#     n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
# k : int. Number of nearest neighbours.
#
# Returns
# -------
# S : array, shape = [(N/100) * n_minority_samples, n_features]
# """
    n_minority_samples, n_features = T.shape

    if N < 100:
       #create synthetic samples only for a subset of T.
       #TODO: select random minortiy samples
       N = 100
       pass

    if (N % 100) != 0:
       raise ValueError("N must be < 100 or multiple of 100")

    N = N/100
    n_synthetic_samples = N * n_minority_samples
    n_synthetic_samples = int(n_synthetic_samples)
    n_features = int(n_features)
    S = np.zeros(shape=(n_synthetic_samples, n_features))

    #Learn nearest neighbours
    neigh = NearestNeighbors(n_neighbors = k)
    neigh.fit(T)

    #Calculate synthetic samples
    for i in range(n_minority_samples):
       nn = neigh.kneighbors(T[i], return_distance=False)
       for n in range(N):
          nn_index = choice(nn[0])
          #NOTE: nn includes T[i], we don't want to select it
          while nn_index == i:
             nn_index = choice(nn[0])

          dif = T[nn_index] - T[i]
          gap = np.random.random()
          S[n + i * N, :] = T[i,:] + gap * dif[:]

    return S

df = df.to_numpy()
new_data = SMOTE(df,50,10) # this is where I call the function and expect new_data to be generated with larger number of samples than original df.

我得到的错误的回溯在下面提到:-

The traceback of the error I get is mentioned below:-

Traceback (most recent call last):
  File "MyScript.py", line 66, in <module>
    new_data = SMOTE(df,50,10)
  File "MyScript.py", line 52, in SMOTE
    nn = neigh.kneighbors(T[i], return_distance=False)
  File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 393, in kneighbors
    X = check_array(X, accept_sparse='csr')
  File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/utils/validation.py", line 547, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:

我知道此错误(预期的2D数组，得到1D数组)发生在行nn = neigh.kneighbors(T[i], return_distance=False)上.精确地，当我调用该函数时，T是形状为numpy的数组(21000x102)，即我从Pandas Dataframe转换为numpy数组的数据.我知道这个问题可能有一些重复，但没有一个回答我的问题.在这方面的任何帮助将不胜感激.

I know that this error (Expected 2D array, got 1D array) is occurring on the line nn = neigh.kneighbors(T[i], return_distance=False). Precisely, when I call the function, T is the numpy array of shape (21000x102), my data which I convert from a Pandas Dataframe to a numpy array. I know that this question may have some similar duplicates, but none of them answer my question. Any help in this regard would be highly appreciated.

根据Python中较小的数据集生成较大的综合数据集 [英] Generate larger synthetic dataset based on a smaller dataset in Python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

根据Python中较小的数据集生成较大的综合数据集 [英] Generate larger synthetic dataset based on a smaller dataset in Python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭