使用条件标准化 Pandas 系列 [英] Normalizing Pandas Series with condition

查看:69
本文介绍了使用条件标准化 Pandas 系列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有以下结构的 DataFrame 学习 Python/Pandas:

I'm learning Python/Pandas with a DataFrame having the following structure:

import pandas as pd

df = pd.DataFrame({'key' : [111, 222, 333, 444, 555, 666, 777, 888, 999],
                   'score1' : [-1, 0, 2, -1, 7, 0, 15, 0, 1], 
                   'score2' : [2, 2, -1, 10, 0, 5, -1, 1, 0]})

print(df)

   key  score1  score2
0  111      -1       2
1  222       0       2
2  333       2      -1
3  444      -1      10
4  555       7       0
5  666       0       5
6  777      15      -1
7  888       0       1
8  999       1       0

score1score2 系列的可能值为 -1 和所有正整数(包括 0).

The possible values for the score1 and score2 Series are -1 and all positive integers (including 0).

我的目标是通过以下方式规范化这两列:

My goal is to normalize both columns the following way:

  • 如果值等于 -1,则返回一个缺失的 NaN
  • 否则,将剩余的正整数归一化为 01 之间的比例.
  • If the value is equal to -1, then return a missing NaN value
  • Else, normalize the remaining positive integers on a scale between 0 and 1.

我不想覆盖原始系列 score1score2.相反,我想在两个系列上应用一个函数来创建两个新列(比如 norm1norm2).

I don't want to overwrite the original Series score1 and score2. Instead, I would like to apply a function on both Series to create two new columns (say norm1 and norm2).

我在这里阅读了几篇文章,建议使用 sklearn 预处理模块中的 MinMaxScaler() 方法.我不认为这是我需要的,因为我需要一个额外的条件来处理 -1 值.

I read several posts here that recommend to use the MinMaxScaler() method from sklearn preprocessing module. I don't think this is what I need since I need an extra condition to take care of the -1 values.

我认为我需要的是一个可以应用于两个系列的特定功能.我也熟悉了规范化的工作原理,但在 Python 中实现此功能时遇到了困难.任何额外的帮助将不胜感激.

What I think I need is a specific function that I can apply on both Series. I also familiarized myself with how normalization works but I'm having difficulties implementing this function in Python. Any additional help would be greatly appreciated.

推荐答案

想法是将 -1 值转换为缺失值:

Idea is convert -1 values to missing values:

cols = ['score1','score2']
df[cols] = df[cols].mask(df[cols] == -1)

x = df[cols].values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = df.join(pd.DataFrame(x_scaled, columns=cols).add_prefix('norm_'))
print (df)
   key  score1  score2  norm_score1  norm_score2
0  111     NaN     2.0          NaN          0.2
1  222     0.0     2.0     0.000000          0.2
2  333     2.0     NaN     0.133333          NaN
3  444     NaN    10.0          NaN          1.0
4  555     7.0     0.0     0.466667          0.0
5  666     0.0     5.0     0.000000          0.5
6  777    15.0     NaN     1.000000          NaN
7  888     0.0     1.0     0.000000          0.1
8  999     1.0     0.0     0.066667          0.0

这篇关于使用条件标准化 Pandas 系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆