用fancyimpute和pandas进行数据归类 [英] Data imputation with fancyimpute and pandas
问题描述
我的大熊猫数据享誉全球df
.它有很多缺失.不能逐行或逐行删除.插补中位数,均值或最频繁的值也不是一种选择(因此,不幸的是,使用pandas
和/或scikit
进行插补并不能解决问题).
I have a large pandas data fame df
. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas
and/or scikit
unfortunately doens't do the trick).
我偶然发现了一个名为fancyimpute
的简洁软件包(您可以在此处).但是我有一些问题.
I came across what seems to be a neat package called fancyimpute
(you can find it here). But I have some problems with it.
这是我的工作:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
但是,df_filled
某种程度上是单个向量,而不是填充的数据帧.如何获得带有插补的数据框?
However, df_filled
is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
我意识到,fancyimpute
需要一个numpay array
.因此,我将df_numeric
转换为使用as_matrix()
的数组.
I realized, fancyimpute
needs a numpay array
. I hence converted the df_numeric
to a an array using as_matrix()
.
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
输出是缺少列标签的数据框.有什么方法可以检索标签吗?
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
推荐答案
df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)
fancyimpute对象(无论是鼠标还是KNN)的.complete()
方法返回的np.array
作为其cols和index与原始数据相同的pandas数据帧的内容(argument data=)
进给.框架.
The np.array
that is returned by the .complete()
method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=)
of a pandas dataframe whose cols and indexes are the same as the original data frame.
这篇关于用fancyimpute和pandas进行数据归类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!