如何在numpy数组中估算每个分类列 [英] How to impute each categorical column in numpy array

查看:101
本文介绍了如何在numpy数组中估算每个分类列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很好的解决方案来估算熊猫数据框.但是由于我主要使用numpy数组,因此我必须创建新的panda DataFrame对象,进行插补,然后按以下方式转换回numpy数组:

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:

nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array

是否可以直接在numpy数组中进行插补?

Is there a way to directly impute in numpy array?

推荐答案

我们可以使用

We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.

因此,实现看起来像这样-

So, the implementation would look something like this -

from scipy.stats import mode

R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]

请注意,对于pandasvalue_counts,如果许多类别/元素具有相同的最高计数,则我们将选择最高值.即在平局情况下.对于Scipy's mode,对于这种平局情况,它将是最低的.

Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.

如果要处理stringsNaNs这样的混合dtype,我建议您进行一些修改,使最后一步保持不变以使其起作用-

If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -

x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()

这将向模式计算发出警告:RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored. "values. nan values will be ignored.", RuntimeWarning).但是既然如此,我们实际上想忽略NaNs进行该模式的计算,那么我们应该可以了.

This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored. "values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

这篇关于如何在numpy数组中估算每个分类列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆