Python:Sci-kit中的功能选择学习正态分布 [英] Python: feature selection in sci-kit learn for a normal distribution

查看:133
本文介绍了Python:Sci-kit中的功能选择学习正态分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas DataFrame,其索引是唯一的用户标识符,对应于唯一事件的列以及值1(有人值守),0(未出席)或NaN(未邀请/不相关).相对于NaN,矩阵非常稀疏:有数百个事件,大多数用户最多只被邀请参加几十个事件.

I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.

我创建了一些额外的列来衡量成功",我将其定义为相对于邀请仅出席的百分比:

I created some extra columns to measure the "success" which I define as just % attended relative to invites:

my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']

假设满足以下条件:成功数据的正态分布应为平均值0.80和s.d. 0.10.当我查看my_data['success']的直方图时,它是不正常的并且向左倾斜.在现实中是否如此并不重要.我只想解决下面提出的技术问题.

Assume the following is true: the success data should be normally distributed with mean 0.80 and s.d. 0.10. When I look at the histogram of my_data['success'] it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.

所以这是我的问题:有些事件在某种程度上使我认为成功数据与正常情况有所不同,我认为这不是好"事件.我想对事件进行特征选择",以选择事件的一个子集,从而使my_data['success']的分布在.

So this is my problem: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution".

我在此处中查看了scikit-learn功能选择"方法单变量特征选择"似乎很有意义.但是我对pandasscikit-learn都是新手,可以真正地使用帮助来实际在代码中实现它.

I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code.

约束:我需要保持至少原始事件的一半.

Constraints: I need to keep at least half the original events.

任何帮助将不胜感激.请分享尽可能多的详细信息,我对这些库来说还很陌生,很想看看如何使用DataFrame做到这一点.

Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.

谢谢!

编辑:在进一步研究了scikit-learn特征选择方法之后,递归特征选择"似乎在这里也很有意义,但是我不确定如何用它来构建它我的准确性"指标接近均值...正态分布"

EDIT: After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..."

推荐答案

请记住,特征选择是选择特征而不是样本,即(通常)选择DataFrame的列而不是行.因此,我不确定是否要选择功能:我知道您想删除那些导致分布偏斜的样本吗?

Keep in mind that feature selection is to select features, not samples, i.e., (typically) the columns of your DataFrame, not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?

另外,如何进行特征缩放(例如标准化),以便您的数据以均值= 0和sd = 1变为正态分布?

Also, what about feature scaling, e.g., standardization, so that your data becomes normal distributed with mean=0 and sd=1?

等式只是z =(x-平均值)/sd

The equation is simply z = (x - mean) / sd

要将其应用于您的DataFrame,只需执行

To apply it to your DataFrame, you can simply do

my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))

但是,不要忘记保留均值和SD参数来转换您的测试数据.另外,您也可以使用 StandardScaler scikit学习

However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn

这篇关于Python:Sci-kit中的功能选择学习正态分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆