每列中的变量fillna() [英] variable fillna() in each column

查看:93
本文介绍了每列中的变量fillna()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于初学者来说,这是一些适合我的问题的人工数据:

For starters, here is some artificial data fitting my problem:

df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), 
          columns = ["col_{}".format(x) for x in range(10)], 
          index = range(0, vsize * 3, 3))

df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)), 
            columns = ["col_{}".format(x) for x in range(10, 20, 1)], 
            index = range(0, vsize * 2, 2))

df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')

df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)], 
               "tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)], 
               "tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)], 
               "tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})

df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')

现在,我想用每列中的非NaN值的MEDIAN值填充每列中的NaN值,但是在该列中的每个填充的NaN中添加噪声.应该首先为该列中属于同一类的值计算MEDIAN值,如最初在tar_4列中所标记的.然后,如果该列中存在任何NaN(因为该列中的某些值全部在tar_4类中,仅包含NaN,因此无法计算MEDIAN),则对已更新的列重复相同的操作(已经填充了一些NaN) (来自tar_4操作),但相对于tar_3列,其值属于同一类.然后是tar_2和tar_1.

Now, I would like to fill NaN values in each column, with a MEDIAN value of non-NaN values in each column, but with noise added to each filled NaN in that column. The MEDIAN value should be calculated for values in that column, which belong to the same class, as marked in column tar_4 at first. Then, if any NaNs persist in the column (because some values in the column were all in tar_4 class which featured only NaNs, so no MEDIAN could be calculated), the same operation is repeated on the updated column (with some NaN's already filled in from tar_4 operation), but with values belonging to the same class relative to tar_3 column. Then tar_2, and tar_1.

我的想象方式如下:

  • col_1功能,例如6非南& 4个NaN值:[1、2,NaN,4,NaN,12、5,NaN,1,NaN]
  • 只有值[1、2,NaN,4,NaN]在tar_4中属于同一类(例如,类1),因此通过NaN填充将它们推入:
    • 索引[2]处的NaN值被MEDIAN(= 2)+ random(-3,3)* col_1中的std分布误差填充,例如2 +(1 * 1.24)
    • 索引[4]上的NaN值被MEDIAN(= 2)+ random(-3,3)* col_1中的std分布误差填充,例如2 +(-2 * 1.24)
    • col_1 features e.g. 6 non-Nan & 4 NaN values: [1, 2, NaN, 4, NaN, 12, 5, NaN, 1, NaN]
    • only values [1, 2, NaN, 4, NaN] belong to the same class (e.g. class 1) in tar_4, so they are pushed through NaN filling:
      • NaN value at index [2] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (1 * 1.24)
      • NaN value at index [4] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (-2 * 1.24)
        在[1、2、1.24、4,-0.48、12、5,NaN,1,NaN]中,
      • 中的值[1、2、1.24、4,-0.48、12、5,NaN]在现在是同一个班级,因此他们得到了处理:
      • 在索引[7]的NaN值被分配索引[0-6](= 2)+ random(-3,3)* std错误的值的MEDIAN. 2 + 2 * 3.86
      • out of [1, 2, 1.24, 4, -0.48, 12, 5, NaN, 1, NaN], values [1, 2, 1.24, 4, -0.48, 12, 5, NaN] are in the same class now, so they get processed:
      • NaN value at index [7] gets assigned MEDIAN of values in indices [0-6] (=2) + random(-3, 3) * std error, e.g. 2 + 2 * 3.86
      • col_1中的所有值都基于tar_2列属于同一类,因此索引[9]的NaN值如上所述使用相同的逻辑进行处理,并最终以值2 *(-1 * 4.05)

      其余各列的逻辑相同.

      因此,预期的输出:在每列中,基于递减的基于tar_4-tar_1列的类的粒度级别,该数据帧具有已填充的NaN值.

      So, the expected output: DataFrame with filled NaN values, in each column based on decreasing level of granularity of classes based on columns tar_4 - tar_1.

      由于@Quang Hoang,我已经有了一个实现该目标的代码:

      I already have a code, which kind of achieves that, thanks to @Quang Hoang:

      def min_max_check(col):
          if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
              return medians[col]
          elif (df[col].dropna() >= 0).all():
              return medians[col] + round(np.random.randint(low = 0, high = 3) * stds[col], 2)
          else:
              return medians[col] + round(np.random.randint(low = -3, high = 3) * stds[col], 2)
      
      
      tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
      cols = [col for col in df.columns if col not in tar_list]
      # since your dataframe may not have continuous index
      idx = df.index
      
      for tar in tar_list:
          medians = df[cols].groupby(by = df[tar]).agg('median')
          std = df[cols].groupby(by = df[tar]).agg(np.std)
          df.set_index(tar, inplace=True)
          for col in cols:
              df[col] = df[col].fillna(min_max_check(col))
          df.reset_index(inplace=True)
      
      df.index = idx
      

      但是,这仅在每个粒度级别用相同的MEDIAN值+噪声填充NaN值.如何增强此代码以生成每个NaN值的变化填充值,例如tar_4,tar_3,tar_2和tar_1级别?

      However, this only fills the NaN values with the same MEDIAN value + noise, at each granularity level. How can this code be enhanced to generate varied fill values for each NaN value at e.g. tar_4, tar_3, tar_2 and tar_1 levels?

      推荐答案

      一种快速的解决方案是在每行中将min_max_check修改为get_noise:

      One quick solution is to modify your min_max_check to get_noise at each row:

      def gen_noise(col):
          num_row = len(df)
      
          # generate noise of the same height as our dataset
          # notice the size argument in randint
          if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
              noise = 0
          elif (df[col].dropna() >= 0).all():
              noise =  np.random.randint(low = 0, 
                                         high = 3, 
                                         size=num_row)
          else:
              noise =  np.random.randint(low = -3, 
                                         high = 3,
                                         size=num_row)
      
          # multiplication with isna() forces those at non-null values in df[col] to be 0
          return noise * df[col].isna()
      

      再后来:

      df.set_index(tar, inplace=True)
      
      for col in cols[:1]:
          noise = gen_noise(col)
          df[col] = (df[col].fillna(medians[col])
                            .add(noise.mul(stds[col]).values)
                    )
      
      df.reset_index(inplace=True)
      


      注意:您可以进一步修改代码,即生成与mediansstds相同大小的noise_df,类似这样


      Note: You can modify the code further in the sense that you generate the noise_df with the same size with medians and stds, something like this

      for tar in tar_list:
          medians = df[cols].groupby(df[tar]).agg('median')
          stds = df[cols].groupby(df[tar]).agg('std')
      
          # generate noise_df here
          medians = medians + round(noise_df*std, 2)
      
          df.set_index(tar, inplace=True)
      
          for col in cols[:1]:
              df[col] = df[col].fillna(medians[col])    
      
          df.reset_index(inplace=True)
      
      df.index = idx
      

      这篇关于每列中的变量fillna()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆