在Python中使用Numpy或Scipy放大相似的值 [英] Amplify values that are similar using Numpy or Scipy in Python

查看:95
本文介绍了在Python中使用Numpy或Scipy放大相似的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Matplotlib绘制的numpy数组.我的问题是这些值非常相似,因此在绘制图形时,不存在可读性.

I have a numpy array that is being plotted using Matplotlib. My issue is that the values are very similar, so when it is graphed the readability is non existent.

0,0,0,0,0,0,0,0,46.29821447,49.49781571,49.83072758,50.89081787,98.49113721,98.5522082,99.29547499,99.91765345,99.93779431,99.95351796,99.98066963,99.99294867,100

请注意一些值是如何聚集的,我的问题是是否有任何方法可以遍历numpy数组并确定那些紧密的编织簇,然后进行放大以将它们分开(不包括零值)?当我在Matplotlib中绘制图形时,这就是图形

Notice how some of the values are clustered, My question is there any method to iterate over the numpy array and determine those close knit clusters and then apply a amplification that separates them excluding the zero values? When I graph them in Matplotlib this is the graph

x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21])
y = np.array([0,0,0,0,0,0,0,0,46.29821447,49.49781571,49.83072758,50.89081787,98.49113721,98.5522082,99.29547499,99.91765345,99.93779431,99.95351796,99.98066963,99.99294867,100])
my_xticks = ['<2.5 uS', '<5 uS', '<10 uS', '<20 uS', '<30 uS', '<40 uS', '<50 uS', '<60 uS', '<70 uS', '<80 uS', '<90 uS', '<100 uS', '<200 uS', '<250 uS', '<350 uS', '<500 uS', '<1 mS', '<2 uS', '<5 mS', '<10 mS', '<1 S']
my_yticks = [0,20,40,60,80,90,95,98,99,99.7,99.9,99.97,99.99,99.997,99.999,99.9997,99.9999,99.99999,99.999999]
plt.xticks(x, my_xticks)
plt.gca().axes.get_yaxis().set_ticks([0,20,40,60,80,90,95,98,99,99.7,99.9,99.97,99.99,99.997,99.999,99.9997,99.9999,99.99999,99.999999])
plt.yticks(y, my_yticks)
plt.plot(x,y, '-r')
plt.plot(x,y, '.')
plt.ylim(bottom=-5, top=105)
plt.grid(axis='y')
plt.xlabel('Latency in Micro Milli Second')
plt.ylabel('Probability in %')
plt.title('Probability Distribution')
plt.show()

以上是我的代码,我想我正在寻找的是桶排序算法,其中如果某些值是x彼此接近的数量,则将它们的值增加x的数量,因此当我绘制新生成的数组时,这些点由于我们在新数组中向它们添加了x的数量,因此图中的彼此非常接近,现在散布开了,可读性更强.

Above is my code, I guess what I'm looking for is a bucket sort algorithm, where if certain values are x amount close to each other increase their values by x amount so when I graph the newly generated array, the points in the graph that were really close to each other, since we added x amount to them in the new array, are now spread apart and are more readable.

更新

我对代码进行了一些更新,以使上面的同一图形具有15个组成相同图形的不同图形.

I have updated my code a bit for getting the same graph above with 15 different plots composing the same graph.

x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21])
# Need to create a function that detects similar values to the first 2 digits 49.x 49.x 99.x 99.x and takes the min and max and assigns it to ylim dynamically
y = np.array([0,0,0,0,0,0,0,0,46.29821447,49.49781571,49.83072758,50.89081787,98.49113721,98.5522082,99.29547499,99.91765345,99.93779431,99.95351796,99.98066963,99.99294867,100])
#override x ticks with latency labels
my_xticks = ['<2.5 uS', '<5 uS', '<10 uS', '<20 uS', '<30 uS', '<40 uS', '<50 uS', '<60 uS', '<70 uS', '<80 uS', '<90 uS', '<100 uS', '<200 uS', '<250 uS', '<350 uS', '<500 uS', '<1 mS', '<2 uS', '<5 mS', '<10 mS', '<1 S']
f,(ax,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9,ax10,ax11,ax12,ax13,ax14,ax15) = plt.subplots(15,1,sharex=True)
#plot array to iterate over and assign different matplot properties for the graph
plotArray = ax,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9,ax10,ax11,ax12,ax13,ax14,ax15
#adjust the value formatter to read upto 7 decimal points 99.xxxxxxx
majorFormatter = FormatStrFormatter('%.7f')
#adjust the vertical spacing between each plot to 0 to stitch them together (no space)
plt.subplots_adjust(hspace=0)
#override x tick labels with custom latency labels
plt.xticks(x, my_xticks)
# start a for loop targeting the 15 different plots to assign properties.
for var, i in enumerate(plotArray):
    #Y-axis grid lines
    i.grid(axis='y')
    #red line
    i.plot(x,y, '-r')
    #points for each value
    i.plot(x,y, '.')
    #over ride y tick labels to only show the tick labels of each data point
    i.set_yticks(y)
    #override value format on yaxis to read 7 decimal points
    i.yaxis.set_major_formatter(majorFormatter)
    #first plot
    if i is (plotArray[0]):
        i.spines['top'].set_visible(True)
        i.tick_params(axis='x', which='both', bottom='off', top='on', labelbottom='off')
    #last plot
    elif i is (plotArray[-1]):
        i.tick_params(axis='x', which='both', bottom='on', top='off', labelbottom='on')
        i.spines['bottom'].set_visible(True)
        i.spines['top'].set_visible(False)
    #inbetween plots
    else:
        i.spines['bottom'].set_visible(False)
        i.spines['top'].set_visible(False)
        i.tick_params(axis='x', which='both', bottom='off', top='off', labelbottom='off')

# Values should be dynamiclly assigned due to different cluster values which are graphed on top of each other
ax.set_ylim(99.95,100)
ax2.set_ylim(99.8,99.95)
ax3.set_ylim(99.5,99.8)
ax4.set_ylim(99,99.5)
ax5.set_ylim(98.5,99)
ax6.set_ylim(98,98.5)
ax7.set_ylim(90,98)
ax8.set_ylim(86,90)
ax9.set_ylim(70,86)
ax10.set_ylim(60,70)
ax11.set_ylim(50,60)
ax12.set_ylim(45,50)
ax13.set_ylim(40,45)
ax14.set_ylim(30,40)
ax15.set_ylim(0,30)

plt.show()

我需要能够遍历整个数组,这些百分比会有所不同.

I need to be able to go over the array that are my percentages which will vary.

0,0,0,0,0,0,0,0,46.29821447,49.49781571,49.83072758,50.89081787,98.49113721,98.5522082,99.29547499,99.91765345,99.93779431,99.95351796,99.98066963,99.99294867,100

为了给图形分配动态Y轴限制,以确保数组中的数据点正确显示在每个绘图中.

In order to assign dynamic Y axis limitations to the graph to insure that the data points in my array are displayed in each plot properly.

  1. 遍历数组并获得非常接近的值,即49.x 49.x 98.x 98.x 99.x 99.x
  2. 捕获这些数字,并为每个集合计算最大和最小的值,即,如果我有一组的4个值99.9995 99.99 99.9994 99.993394它会为该集合输出(99.99,99.9995),然后我可以将其分配给yaxis限制在15个图中的其中一个图中捕获这些点,并确保它们在图形上分散并可读.

推荐答案

以这种方式绘制数据以准确表示分位数之间的细微差别和大规模跳跃几乎是不可能的. y轴不连续可能会引起混乱,但是最终当您最终不得不应用各种非线性才能将数据拟合到您的轴上时,解释该图就变得非常困难.

It's pretty much impossible to plot data like this in such a way as to accurately represent both the fine-scale differences between quantiles and the large-scale jumps. You can mess around with having a discontinuous y-axis, but eventually when you end up having to apply all kinds of non-linearities in order to fit the data onto your axes then it becomes very difficult to interpret the plot.

您是否有一些非常重要的原因需要绘制累积分布函数,而不是比概率密度函数?

Is there some very important reason why you have to plot the cumulative distribution function, rather than the probability density function?

这是半对数轴上数据的PDF实际外观:

Here's what the PDF of your data actually looks like on semi-log axes:

import numpy as np
from matplotlib import pyplot as plt


x = np.array([2.5E-06, 5.0E-06, 1.0E-05, 2.0E-05, 3.0E-05, 4.0E-05, 5.0E-05,
              6.0E-05, 7.0E-05, 8.0E-05, 9.0E-05, 1.0E-04, 2.0E-04, 2.5E-04,
              3.5E-04, 5.0E-04, 1.0E-03, 2.0E-03, 5.0E-03, 1.0E-02, 1.0E+00])

y = np.array([ 0.        ,  0.        ,  0.        ,  0.        ,
               0.        ,  0.        ,  0.        ,  0.        ,
              46.29821447, 49.49781571, 49.83072758, 50.89081787,
              98.49113721, 98.5522082 , 99.29547499, 99.91765345,
              99.93779431, 99.95351796, 99.98066963, 99.99294867,  100.]) / 100.

# we can get a rough estimate the PDF from the derivative of the CDF using
# second-order central differences (it would be better to evaluate the PDF
# directly if you can)
dx = np.gradient(x)
dy = np.gradient(y)

fig, ax = plt.subplots(1, 1)
ax.set_xscale('log')
ax.fill_between(x, 0, (dy / dx), alpha=0.5)

ax.set_ylabel('Probability density')
ax.set_xlabel('S')

在我看来,PDF对发生的事情给出了更清晰的直觉.基本上,对于接近70uS的值,您具有较高的概率密度,在100μS附近的峰值较小,然后在其他所有位置几乎为零.

In my opinion the PDF gives a much clearer intuition of what's going on. Basically you have a high probability density for values close to ~70uS, a smaller peak around ~100uS, then almost zero probability everywhere else.

如您所见,PDF中的这些峰非常尖锐,这意味着当您计算CDF(积分)时,最终会得到许多非常相似的分位数,然后对应于大部分概率密度的巨大跃迁是.

As you can see these peaks in the PDF are very sharp, which means that when you compute the CDF (integral) you end up with lots of quantiles that are very similar, then huge jumps corresponding to where most of the probability density is.

CDF中的跃迁(对应于PDF中的峰)可能是代表概率分布的最显着特征,因为它们反映了您最有可能采样的值.

The jumps in the CDF (corresponding to the peaks in the PDF) are probably the most salient features of the probability distribution to represent, since they reflect the values that you are most likely to sample.

这篇关于在Python中使用Numpy或Scipy放大相似的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆