使用 Sklearn 在 Pandas DataFrame 中仅标准化数字列时的 SettingWithCopy 警告 [英] SettingWithCopy Warning when Standardizing Only Numeric Columns in Pandas DataFrame with Sklearn

查看:18
本文介绍了使用 Sklearn 在 Pandas DataFrame 中仅标准化数字列时的 SettingWithCopy 警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在执行以下操作时,我收到了来自 Pandas 的 SettingWithCopyWarning.我理解警告的含义,我知道我可以关闭警告,但我很好奇我是否使用 Pandas 数据框错误地执行了这种类型的标准化(我将数据与分类列和数字列混合在一起).检查后我的数字看起来不错,但我想清理我的语法以确保我正确使用 Pandas.

I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.

我很好奇在处理具有像这样的混合数据类型的数据集时是否有更好的工作流程来处理此类操作.

I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.

我的流程如下,有一些玩具数据:

My process is as follows with some toy data:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List

# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
                                ['1',125,'A',15],
                                ['2',134,'A',20],
                                ['3',112,'A',25],
                                ['4',107,'B',35],
                                ['5',68,'B',50],
                                ['6',321,'B',10],
                                ['7',26,'B',27],
                                ['8',115,'C',64],
                                ['9',100,'C',72],
                                ['10',74,'C',18],
                                ['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id        object
weight     int64
type      object
age        int64
dtype: object

# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()

# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only 
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])


<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

推荐答案

你的 X_trainX_test 仍然是原始数据帧的切片.修改切片会触发警告并且通常不起作用.

Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.

你可以在train_test_split之前进行转换,否则在split之后做X_train = X_train.copy(),然后再进行转换.

You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.

第二种方法可以防止代码中注释的信息泄漏.所以是这样的:

The second approach would prevent information leak as commented in your code. So something like this:

# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy()    # don't you drop the label?
# y: pd.Series = df.pop('type')  # y = df['type']

# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'], 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()

## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)

X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])

这篇关于使用 Sklearn 在 Pandas DataFrame 中仅标准化数字列时的 SettingWithCopy 警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆