如何对包含numpy.ndarrays的列/列的pandas数据框执行StandardScaler? [英] How to perform StandardScaler on pandas dataframe with a column/columns containing numpy.ndarrays?

查看:148
本文介绍了如何对包含numpy.ndarrays的列/列的pandas数据框执行StandardScaler?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中有一些带有numpy.ndarrays的列:

  col1         col2           col3         col4
0  4    array([34, 56, 234])   7     array([765, 654])
1  3    array([11, 598, 1])    89    array([34, 90])

我想进行某种类型的缩放.

我已经做了很标准的事情:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

我碰到了预期的错误:

ValueError: setting an array element with a sequence.

我需要帮助标准化这些numpy数组以及其他所有内容!

解决方案

StandardScaler期望每列具有数字值,但是col2col4具有序列,因此会出现错误. /p>

我认为最好分别处理具有序列的列,然后再结合其余数据.

现在,我假设所有行都没有.给定列的元素顺序是相同的,例如col_2的所有行都有3个值数组.

因为,StandardScaler分别为所有列计算meanstd.序列列有两种方法:

方法1:序列中所有位置的元素都来自同一分布.

在这种情况下,应该在所有值上获得meanstd.将StandardScaler拟合到展平的数组上后,将其重新整形为原始形状.

方法2:位于序列不同位置的元素来自不同的分布.

在这种情况下,单列可以转换为2D numpy数组.您可以将StandardScaler拟合到该2D数组上(每个列meanstd将分别计算),并在转换后将其放回单列.

下面是这两种方法的代码:

# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]

sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)

# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)

# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)

X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)


# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)

X_test_2 = sc_col2.transform(X_test_col2)

# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()

# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))


在方法2中,可以先堆叠所有列,然后一次完成对所有列的执行.

I have a pandas dataframe that has some columns with numpy.ndarrays:

  col1         col2           col3         col4
0  4    array([34, 56, 234])   7     array([765, 654])
1  3    array([11, 598, 1])    89    array([34, 90])

And I would like to preform some type of scaling on.

I have done the pretty standard thing of:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

and I run into the pretty expected error of:

ValueError: setting an array element with a sequence.

I need help standardizing these numpy arrays along with everything else!

解决方案

StandardScaler expects each column to have numeric values but col2 and col4 have sequences and hence the error.

I think it would be best to treat columns with sequences separately and then combine back with rest of data.

For now, I will assume for all rows, no. of elements in sequence for a given column is same, e.g. all rows of col_2 have 3 value array.

Since, StandardScaler calculates mean and std for all columns individually. There are two approaches for sequence columns:

Approach 1: Elements at all positions of sequence come from same distribution.

In this case, you should get mean and std over all values. After fitting StandardScaler on flattened array, reshape it back to original shape.

Approach 2: Elements at different position of sequence come from different distributions.

In this scenario, a single column can be converted to a 2D numpy array. You can fit StandardScaler on that 2D array (each column mean and std will be calculated separately) and bring it back to single column after transformation.

Below is code for both approaches:

# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]

sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)

# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)

# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)

X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)


# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)

X_test_2 = sc_col2.transform(X_test_col2)

# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()

# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))


In approach 2, it is possible to stack all columns first and then perform StandarScaler on all of them in one shot.

这篇关于如何对包含numpy.ndarrays的列/列的pandas数据框执行StandardScaler?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆