如何在ROSE,SMOTE等数据平衡技术后保留ID [英] How to preserve id's after data balancing technique like ROSE, SMOTE

查看:153
本文介绍了如何在ROSE,SMOTE等数据平衡技术后保留ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0) 
> df
  id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
  10  0.0  0.0  0.1 0.00 0.00  0.4     0
   2  0.2  0.0  0.0 0.22 0.00  0.0     0
   4  0.5  0.0  0.0 0.35 0.00  0.0     0
   5  0.8  0.0  0.0 0.80 0.00  0.0     0
   6  0.0  0.9  0.0 0.00 0.70  0.0     0
   7  0.0  0.3  0.0 0.00 0.30  0.0     1
   8  0.0  0.0  0.2 0.00 0.00  0.4     1
   9  0.0  0.0  0.8 0.00 0.00  0.9     0
  A1  0.0  0.0  0.1 0.00 0.00  0.2     0
  B3  0.0  0.3  0.0 0.00 0.23  0.0     0

我正在使用ROSE函数生成不平衡数据的样本。但是,我想保留ROSE之后来自df的每个观测值的ID。使用ROSE之后,我的输出低于输出。

I am using ROSE function to generate samples for imbalanced data. But, I want to preserve the id's for each observation from df after ROSE. I am getting below output after using ROSE.

 df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data

> df.rose
 id        s1c1         s1c2          s1c3        s2c1         s2c2        s2c3   class
 B3 -0.24636399  0.513435064 -0.0844105623  0.04695640  0.419960189  0.08112992     0
  9 -0.05029030  0.199689698  0.7022285344  0.08255245 -0.133951228  1.16820765     0
  9 -0.23671562  0.167377715  0.9634146745 -0.10923003 -0.129948534  1.00641398     0
 B3 -0.16816685  0.434632663 -0.0174671002 -0.07245581  0.423706144 -0.07969934     0
  9 -0.14420654 -0.015047974  0.8530741203 -0.22148879 -0.053786877  1.18091542     0
  9 -0.38914709 -0.074365870  0.7940190162 -0.23306056 -0.230564666  1.14293933     0
  6  0.19329086  0.807524478 -0.0089820194  0.06600218  0.734243934  0.13409831     0
  6  0.03538563  0.731147735  0.2867432037  0.09746303  0.673766711  0.05837655     0
  4  0.23741363 -0.050535412 -0.0473024899  0.36152575  0.001088718 -0.15354050     0
  2  0.48927513 -0.307561385  0.3177238885  0.42054668  0.072770343  0.33271737     0
 B3  0.09839211  0.827176406 -0.3244875053  0.44579006  0.159991098 -0.14678016     0
 B3 -0.06807770  0.593601657  0.1224855617 -0.10677452  0.351707470  0.53486376     0
  9  0.20651979 -0.272977578  0.8259493668 -0.50212781 -0.041644690  1.27476593     0
  8  0.00000000 -0.008315345  0.0008152742  0.00000000  0.043469230  0.29596908     1
  7  0.00000000  0.155050387 -0.0068404803  0.00000000  0.314397160 -0.50556877     1
  7  0.00000000 -0.008021610  0.0639465277  0.00000000  0.122372337  0.27856790     1
  8  0.00000000 -0.070217063  0.2370763279  0.00000000 -0.013168583  0.04034823     1
  7  0.00000000  0.469712631  0.0130102656  0.00000000  0.566767608  0.18219645     1
  7  0.00000000  0.193749720 -0.0788801623  0.00000000  0.383380004  0.47007644     1
  7  0.00000000  0.412273782 -0.1046108759  0.00000000  0.307614552 -0.35552820     1

ROSE之后我没有得到所有ID。我想得到我所有的身份证。如果有人知道任何其他方法可以通过为每个观察值保留id来处理不平衡数据。我不想弄乱身份证。我尝试过采样,欠采样,SMOTE。但是,没有好的结果。我曾尝试将id列转换为因数,但没有用。

I am not getting all id's after ROSE. I want to get my all the id's. If any one know any other method to handle imbalance data by preserving id for each observation. I don't want to messed up id's. I have tried oversampling, undersampling, SMOTE. But, no good results. I have tried converting id column to factor but didn't work.

推荐答案

如果有人仍然想知道,我最终使用了这个方法。我只需要新的综合观测值,但SMOTE一直在减小数据集的大小。希望对您有所帮助:

If anyone is still wondering, I ended up using this method. I wanted only the new synthetic observations, but SMOTE kept reducing the size of my dataset. Hope it helps:

library(DMwR)
library(dplyr)

# df - dataframe you want to use over/undersampling on

df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)




  1. 创建ID列以确保该行完全相同(不是
    具有相同功能集的观察结果)

  2. 使用具有所需参数的SMOTE(其中 var 是二进制变量

  3. 用一定水平的 var 子集化综合观察结果-在
    中,是;

  4. 将行绑定子集绑定到原始数​​据集。

  5. 删除SMOTE中引入的重复项。

  6. 您最终得到的原始数据集只有合成观测值
    ,且期望水平超过/低于采样。

  1. Create ID column so it will ensure that rows are exactly the same (not an observation with same set of features)
  2. Use SMOTE with desired parameters (where var is the binary variable on which you have imbalance).
  3. Subset the synthetic observations with var of certain level - in this case "yes" level.
  4. Row bind subset to the original dataset.
  5. Remove duplicates introduced in SMOTE.
  6. And you end up with original dataset with only synthetic observations with desired level over/undersampled.


这篇关于如何在ROSE,SMOTE等数据平衡技术后保留ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆