Python - Pandas,重新采样数据集以具有平衡的类 [英] Python - Pandas, Resample dataset to have balanced classes

查看:30
本文介绍了Python - Pandas,重新采样数据集以具有平衡的类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下数据框,只有 2 个可能的标签:

With the following data frame, with only 2 possible lables:

   name  f1  f2  label
0     A   8   9      1
1     A   5   3      1
2     B   8   9      0
3     C   9   2      0
4     C   8   1      0
5     C   9   1      0
6     D   2   1      0
7     D   9   7      0
8     D   3   1      0
9     E   5   1      1
10    E   3   6      1
11    E   7   1      1

我编写了一个代码来按名称"列对数据进行分组,并将结果转换为一个 numpy 数组,因此每一行都是特定组的所有样本的集合,而标签是另一个 numpy 数组:

I've written a code to group the data by the 'name' column and pivot the result into a numpy array, so each row is a collection of all the samples of a specific group, and the lables are another numpy array:

数据:

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
[[2 1] [9 7] [3 1]] # D lable = 0
[[5 1] [3 6] [7 1]] # E lable = 1

标签:

[[1]
 [0]
 [0]
 [0]
 [1]]

代码:

import pandas as pd
import numpy as np


def prepare_data(group_name):
    df = pd.read_csv("../data/tmp.csv")


    group_index = df.groupby(group_name).cumcount()
    data = (df.set_index([group_name, group_index])
            .unstack(fill_value=0).stack())



    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())
    print(data)
    print(target)


prepare_data('name')

我想从过度代表的类中重新采样并删除实例.

I would like to resample and delete instances from the over-represented class.

[[8 9] [5 3] [0 0]] # A lable = 1
[[8 9] [0 0] [0 0]] # B lable = 0
[[9 2] [8 1] [9 1]] # C lable = 0
# group D was deleted randomly from the '0' labels 
[[5 1] [3 6] [7 1]] # E lable = 1

将是一个可接受的解决方案,因为删除 D(标记为0")将产生 2 * 标签1"和 2 * 标签0"的平衡数据集.

would be an acceptable solution, since removing D (labeled '0') will result with a balanced dataset of 2 * label '1' and 2 * label '0'.

推荐答案

前提是每个 name 都由一个 label 标记(例如所有 A1) 你可以使用以下代码:

Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:

  1. labelname 进行分组,并检查哪个标签多余(就唯一名称而言).
  2. 从过度代表的标签类别中随机删除名称,以解决多余的问题.
  3. 选择数据框中不包含已删除名称的部分.
  1. Group the names by label and check which label has an excess (in terms of unique names).
  2. Randomly remove names from the over-represented label class in order to account for the excess.
  3. Select the part of the data frame which does not contain the removed names.

代码如下:

labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.
labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[~df.name.isin(remove)]

这篇关于Python - Pandas,重新采样数据集以具有平衡的类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆