与seaborn一起绘制时如何处理缺失值? [英] What to do with missing values when plotting with seaborn?

查看:74
本文介绍了与seaborn一起绘制时如何处理缺失值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用lambda以下函数将缺失的值替换为NaN:

I replaced the missing values with NaN using lambda following function:

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

,其中数据是我正在处理的数据框.

,where data is the dataframe I am working on.

之后使用seaborn,我尝试使用seaborn.distplot绘制其属性之一,即消耗量,如下所示:

Using seaborn afterwards,I tried to plot one of its attributes,alcconsumption using seaborn.distplot as follows:

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

这给了我以下错误:

AttributeError: max must be larger than min in range parameter.

推荐答案

在绘制数据之前,我肯定会在 之前处理缺失的值.是否不使用dropna()将完全取决于数据集的性质. alcconsumption是数据系列的单个序列还是一部分?在后一种情况下,使用dropna()也会删除其他列中的相应行.缺失值是很少还是很多?它们是在您的系列中四处传播吗,还是倾向于成群出现?也许有理由相信您的数据集中存在趋势吗?

I would definitely handle missing values before you plot your data. Whether ot not to use dropna() would depend entirely on the nature of your dataset. Is alcconsumption a single series or part of a dataframe? In the latter case, using dropna() would remove the corresponding rows in other columns as well. Are the missing values few or many? Are they spread around in your series, or do they tend to occur in groups? Is there perhaps reason to believe that there is a trend in your dataset?

如果缺少的值很少且分散,则可以方便地使用dropna().在其他情况下,我会选择用先前观察到的值(1)填充缺失值.甚至用内插值(2)填充缺失值.不过要小心!用填充或内插的观察值替换许多数据可能会严重中断您的数据集并导致非常错误的结论.

If the missing values are few and scattered, you could easiliy use dropna(). In other cases I would choose to fill missing values with the previously observed value (1). Or even fill the missing values with interpolated values (2). But be careful! Replacing a lot of data with filled or interpolated observations could seriously interrupt your dataset and lead to very wrong conlusions.

以下是使用您的代码段的示例...

Here are some examples that use your snippet...

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

...在综合数据集上:

... on a synthetic dataset:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def sample(rows, names):
    ''' Function to create data sample with random returns

    Parameters
    ==========
    rows : number of rows in the dataframe
    names: list of names to represent assets

    Example
    =======

    >>> sample(rows = 2, names = ['A', 'B'])

                  A       B
    2017-01-01  0.0027  0.0075
    2017-01-02 -0.0050 -0.0024
    '''
    listVars= names
    rng = pd.date_range('1/1/2017', periods=rows, freq='D')
    df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) 
    df_temp = df_temp.set_index(rng)


    return df_temp

df = sample(rows = 15, names = ['A', 'B'])
df['A'][8:12] = np.nan
df

输出:

            A   B
2017-01-01 -63.0  10
2017-01-02  49.0  79
2017-01-03 -55.0  59
2017-01-04  89.0  34
2017-01-05 -13.0 -80
2017-01-06  36.0  90
2017-01-07 -41.0  86
2017-01-08  10.0 -81
2017-01-09   NaN -61
2017-01-10   NaN -80
2017-01-11   NaN -39
2017-01-12   NaN  24
2017-01-13 -73.0 -25
2017-01-14 -40.0  86
2017-01-15  97.0  60

(1)对 pandas.DataFrame.fillna(方法=填充)

ffill将向前填充值",这意味着它将用上面一行的值替换nan.

ffill will "fill values forward", meaning it will replace the nan's with the value of the row above.

df = df['A'].fillna(axis=0, method='ffill')
sns.distplot(df, hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

(2)与根据不同的方法内插值.时间插值适用于每日和更高分辨率的数据,以插值给定的时间间隔长度.

Interpolate values according to different methods. Time interpolation works on daily and higher resolution data to interpolate given length of interval.

df['A'] = df['A'].interpolate(method = 'time')
sns.distplot(df['A'], hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

如您所见,不同的方法呈现两个截然不同的结果.希望对您有用.如果没有,那么让我知道,我会再看一遍.

As you can see, the different methods render two very different results. I hope this will be useful to you. If not then let me know and I'll have a look at it again.

这篇关于与seaborn一起绘制时如何处理缺失值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆