如何规范seaborn distplot？ [英] How to normalize seaborn distplot?

查看：89 发布时间：2020/10/22 19:26:53 python python-3.x statistics seaborn distribution

本文介绍了如何规范seaborn distplot？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

出于可重复性原因，数据集和可重复性原因，我在[此处] [1]共享它。

这就是我正在做的-从第2列中，我正在读取当前行并将其与上一行的值进行比较。如果更大，我会继续比较。如果当前值小于上一行的值，我想将当前值（较小）除以上一个值（较大）。因此，以下代码：

这给出了以下图。

  sns.distplot（商，hist = False，标签= Protname）

地段

数据-V 的商数为 quotient_times 小于3，并且<< c $ c> quotient_times 大于3的
时，商仍为0.5。

我想对值进行归一化，以使第二个绘图值的 y轴在0和1之间。

解决方案

前言

据我所知，默认情况下，seaborn distplot会进行kde估计。
如果您想要归一化的distplot图，那可能是因为您假设该图的Ys应该限制在[0; 1]之间。如果是这样，则堆栈溢出问题引发了以下问题：

 导入numpy为np 
导入matplotlib 
导入matplotlib.pyplot为plt 
进口seaborn as sns 
 import sys 
 
 print（'系统版本：{}'。format（sys.version））
 print（'系统版本：{}'。format（sys。 version_info））
 print（'Numpy versqion：{}'。format（np .__ version__））
 print（'matplotlib.pyplot version：{}'。format（matplotlib .__ version__））
 print（'seaborn version：{}'。format（sns .__ version__））
 
协议= {} 
 
 types = { data_v： data_v.csv } 
 
 for protname，types.items（）中的fname：
 col_time，col_window = np.loadtxt（fname，delimiter ='，'）。T 
 Trailing_window = col_window [ ：-1]＃在给定索引
上的 past值Lead_window = col_window [1：]＃在给定索引
上的当前值
 reduction_inds = np.where（Leading_window< Trailing_window）[0] 
商=领先窗口[decreasing_inds] / trailing_window [decreasing_inds] 
 quotient_times = col_time [decreasing_inds] 
 
 protocol [protname] = {
 col_time ：col_time，
 col_window：col_window，
 quotient_times：quotient_times，
 quotient：商，
} 
 
图，（ ax1，ax2）= plt.subplots（1,2，sharey = False，sharex = False）
g = sns.distplot（商，hist = True，标签= protname，ax = ax1，rug = True）
 ax1.set_title（'basic distplot（kde = True）'）
＃获取distplot线点
 line = g.get_lines（）[0] 
 xd = line.get_xdata（） 
 yd = line.get_ydata（）
＃https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python 
 def normalize（x）：
 return（x-x.min（0））/ x.ptp（0）
＃归一化点
 yd2 = normalize（yd）
＃在另一张图$ b中绘制它们$ b ax2.plot（xd，yd2）
 ax2.set_title（'基本distplot（kde = True）\具有标准化的y绘图值'）
 
 plt.show（）

选项2

下面，我尝试执行kde并对获得的估计值进行归一化。我不是统计专家，所以kde用法可能在某些方面是错误的（正如截图中所见，它与seaborn的用法不同，这是因为seaborn的工作方式比我好得多。它只是试图模仿结果还不错，我猜）

截图：

代码：

 从scipy导入统计信息中将numpy导入为np 
 
导入matplotlib 
导入matplotlib.pyplot作为plt 
导入seaborn为sns 
导入sys 
 
 print（'系统版本：{}'。format（sys.version））
 print（'系统版本：{}'。format（sys.version_info））
 print（ 'Numpy versqion：{}'。format（np .__ version__））
 print（'matplotlib.pyplot version：{}'。format（matplotlib .__ version__））
 print（'seaborn version ：{}'。format（sns .__ version__））
 
协议= {} 
 
类型= { data_v： data_v.csv} 
 
代表protname，fnames在types.items（）中：
 col_time，col_window = np.loadtxt（fname，delimiter ='，'）。T 
 Trailing_window = col_window [：-1]＃给定索引
的过去值Lead_window = col_window [1：]＃给定索引
的当前值减少的索引= np.where（Leading_window< Trailing_window）[0] 
商=领先窗口[decreasing_inds] / trailing_window [decreasing_inds] 
 quotient_times = col_time [decreasing_inds] 
 
 protocol [protname] = {
 col_time ：col_time，
 col_window：col_window，
 quotient_times：quotient_times，
 quotient：商，
} 
 
图，（ ax1，ax2，ax3，ax4）= plt.subplots（1,4，sharey = False，sharex = False）
 diff = quotient_times 
 ax1.plot（diff，商，。，label = protname，color = blue）
 ax1.set_ylim（0，1.0001）
 ax1.set_title（protname）
 ax1.set_xlabel（ quotient_times）
 ax1.set_ylabel（  quotient）
 ax1.legend（）
 
 sns.distplot（quotient，hist = True，label = protname，ax = ax2，rug = True）
 ax2.set_title （'basic distplot（kde = True）'）
 
＃取自seaborn的源代码（utils.py和distributions.py）
 def seaborn_kde_support（data，bw，gridsize，cut，clip ）：
如果clip为None：
 clip =（-np.inf，np.inf）
 support_min = max（data.min（）-bw * cut，clip [0]）
 support_max = min（data.max（）+ bw *剪切，clip [1]）$ b $ b返回np.linspace（support_min，support_max，gridsize）
 
 kde_estim = stats.gaussian_kde（quotient， bw_method ='scott'）
 
＃数据的手动线性化
 #linearized = np.linspace（quotient.min（），quotient.max（），num = 500）
 
＃或更高：模仿seaborn的内部东西
 bw = kde_estim.scotts_factor（）* np.std（quotient）
 linearized = seaborn_kde_support（quotient，bw，100，3，None）
 
＃计算估计线性化输入上的估计函数的值
 Z = kde_estim.evaluate（linearized）
 
＃https://stackoverflow.com/questions/29661574 / normalize-numpy-array-columns-python 
 def normalize（x）：
 return（x-x.min（0））/ x.ptp（0）
 
＃归一化tween 0; 1 
 Z2 = normalize（Z）
 for name，func in {'min'：np.min，'max'：np.max} .items（）：
打印（'{}：source = {}，normalized = {}'。format（name，func（Z），func（Z2）））
 
＃图与Seaborns不同，因为方法不完全相同应用
 ax3.plot（线性化，Z，。，label = protname，color = orange）
 ax3.set_title（'非线性化的高斯kde值'）
 
＃Y轴a值的手动kde结果归一化（在0; 1之间）
 ax4.plot（线性化，Z2，。，label = protname，颜色=绿色）
 ax4.set_title（ '归一化的高斯kde值'）
 
 plt.show（）

输出：

 系统版本：3.7.2（默认值，2019年2月21日，17：35：59）[MSC v.1915 64位（AMD64）] 
系统版本：sys.version_info（主要= 3，次要= 7，微型= 2，发行级别='最终'，序列= 0）
 numpy版本：1.16.2 
 matplotlib.pyplot版本：3.0.2 
 seaborn v修订：0.9.0 
分钟：source = 0.0021601491646143518，归一化= 0.0 
 max：来源= 9.67319154426489，归一化= 1.0

与评论相反，标绘：

  [（x-min（商））/（max（商）-min（商））的x商]

不更改行为！它仅更改用于内核密度估计的源数据。曲线形状将保持不变。

For reproducibility reasons, the dataset and for reproducibility reasons, I am sharing it [here][1].

Here is what I am doing - from column 2, I am reading the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, the following code:

This gives the following plots.

sns.distplot(quotient, hist=False, label=protname)

As we can see from the plots

Data-V has a quotient of 0.8 when the quotient_times is less than 3 and the quotient remains 0.5 if the quotient_times is greater than 3.

I want to normalize the values so that we have y-axis of the second plot values between 0 and 1. How do we do that in Python?

解决方案

Foreword

From what I understand, the seaborn distplot by default does a kde estimation. If you want a normalized distplot graph, it could be because you assume that the graph's Ys should be bounded between in [0;1]. If so, a stack overflow question has raised the question of kde estimators showing values above 1.

Quoting one answer:

a continous pdf (pdf=probability density function) never says the value to be less than 1, with the pdf for continous random variable, function p(x) is not the probability. you can refer for continuous random variables and their distrubutions

Quoting first comment of importanceofbeingernest:

The integral over a pdf is 1. There is no contradiction to be seen here.

From my knowledge it is the CDF (Cumulative Density Function) whose values are supposed to be in [0; 1].

Notice: All possible continuous fittable functions are on SciPy site and available in the package scipy.stats

Maybe have also a look at probability mass functions ?

If you really want to have the same graph normalized, then you should gather the actual data points of the plotted function (Option1), or the function definition (Option 2), and normalize them yourself and plot them again.

Option 1

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False)
    g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True)
    ax1.set_title('basic distplot (kde=True)')
    # get distplot line points
    line = g.get_lines()[0]
    xd = line.get_xdata()
    yd = line.get_ydata()
    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)
    #normalize points
    yd2 = normalize(yd)
    # plot them in another graph
    ax2.plot(xd, yd2)
    ax2.set_title('basic distplot (kde=True)\nwith normalized y plot values')

    plt.show()

Option 2

Below, I tried to perform a kde and normalize the obtained estimation. I'm not a stats expert, so the kde usage might be wrong in some way (It is different from seaborn's as one can see on the screenshot, this is because seaborn does the job way much better than me. It only tried to mimic the kde fitting with scipy. The result is not so bad i guess)

Screenshot:

Code:

import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()

    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')

    # taken from seaborn's source code (utils.py and distributions.py)
    def seaborn_kde_support(data, bw, gridsize, cut, clip):
        if clip is None:
            clip = (-np.inf, np.inf)
        support_min = max(data.min() - bw * cut, clip[0])
        support_max = min(data.max() + bw * cut, clip[1])
        return np.linspace(support_min, support_max, gridsize)

    kde_estim = stats.gaussian_kde(quotient, bw_method='scott')

    # manual linearization of data
    #linearized = np.linspace(quotient.min(), quotient.max(), num=500)

    # or better: mimic seaborn's internal stuff
    bw = kde_estim.scotts_factor() * np.std(quotient)
    linearized = seaborn_kde_support(quotient, bw, 100, 3, None)

    # computes values of the estimated function on the estimated linearized inputs
    Z = kde_estim.evaluate(linearized)

    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)

    # normalize so it is between 0;1
    Z2 = normalize(Z)
    for name, func in {'min': np.min, 'max': np.max}.items():
        print('{}: source={}, normalized={}'.format(name, func(Z), func(Z2)))

    # plot is different from seaborns because not exact same method applied
    ax3.plot(linearized, Z, ".", label=protname, color="orange")
    ax3.set_title('Non linearized gaussian kde values')

    # manual kde result with Y axis avalues normalized (between 0;1)
    ax4.plot(linearized, Z2, ".", label=protname, color="green")
    ax4.set_title('Normalized gaussian kde values')

    plt.show()

Output:

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
min: source=0.0021601491646143518, normalized=0.0
max: source=9.67319154426489, normalized=1.0

Contrary to a comment, plotting:

[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]

Does not change the behavior ! It only changes the source data for kernel density estimation. The curve shape would remain the same.

Quoting seaborn's distplot doc:

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data.

By default:

kde : bool, optional set to True Whether to plot a gaussian kernel density estimate.

It uses kde by default. Quoting seaborn's kde doc:

Fit and plot a univariate or bivariate kernel density estimate.

Quoting SCiPy gaussian kde method doc:

Representation of a kernel-density estimate using Gaussian kernels.

Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

Note that I do believe that your data are bimodal, as you mentioned it yourself. They also look discrete. As far as I know, discrete distribution function may not be analyzed in the same way continuous are, and fitting may proove tricky.

Here is an example with various laws:

import numpy as np
from scipy.stats import uniform, powerlaw, logistic
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }
    fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()
    quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
    print(quotient2)
    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')
    sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True)
    ax3.set_title('logistic distplot')

    sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform)
    ax4.set_title('uniform distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw)
    ax5.set_title('powerlaw distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic)
    ax6.set_title('logistic distplot')
    plt.show()

Output:

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
[1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]

Screenshot:

这篇关于如何规范seaborn distplot？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何规范seaborn distplot？ [英] How to normalize seaborn distplot?

问题描述

前言

选项2

Foreword

Option 1

Option 2

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何规范seaborn distplot？ [英] How to normalize seaborn distplot?

问题描述

前言

选项2

Foreword

Option 1

Option 2

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭