在scikit-learn make_circle()中添加高斯噪声= 0.05是什么意思?它将如何影响数据? [英] What does it mean to add gaussian noise = 0.05 in scikit-learn make_circle()? How will it affect the data?

查看:88
本文介绍了在scikit-learn make_circle()中添加高斯噪声= 0.05是什么意思?它将如何影响数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究神经网络的超参数调整,并通过示例进行研究.我在一个示例中遇到了这段代码:

  train_X,train_Y = sklearn.datasets.make_circles(n_samples = 300,noise = .05) 

我知道添加噪声对数据具有正则化作用.阅读文档说明这会增加高斯噪声.但是,在上面的代码中,我无法理解在数据中添加 0.05 噪声是什么意思.这将如何在数学上影响数据?

我尝试了以下代码.我可以看到值发生了变化,但无法弄清楚例如通过将 noise = .05 添加到数组2中的相应行(即此处的x_1)来更改数组1中x的row1值如何?

  np.random.seed(0)x,y = sklearn.datasets.make_circles()打印(x [:5 ,:])x_1,y_1 = sklearn.datasets.make_circles(噪声= .05)打印(x_1 [:5 ,:]) 

输出:

  [[-9.92114701e-01 -1.25333234e-01][-1.49905052e-01 -7.85829801e-01][9.68583161e-01 2.48689887e-01][6.47213595e-01 4.70228202e-01][-8.00000000e-01 -2.57299624e-16]][[-0.66187208 0.75151712][-0.86331995 -0.56582111][-0.19574479 0.7798686 ][0.40634757 -0.78263011][-0.7433193 0.26658851] 

解决方案

根据

因此,看起来好像正在创建两个同心圆,每个圆具有不同的标签.

让我们将噪声增加到 noise = 0.05 并查看结果:

  n_samples = 100噪音= 0.05#< ---唯一的变化特征,标签= make_circles(n_samples = n_samples,noise = noise)df = pd.DataFrame(dict(x = features [:,0],y = features [:,1],label = labels))grouped = df.groupby('label')颜色= {0:红色",1:蓝色"}无花果,ax = plt.subplots(figsize =(7,7))对于密钥,分组分组:group.plot(ax = ax,kind ='scatter',x ='x',y ='y',marker ='.',label = key,color = colors [key])plt.title('积分')plt.xlim(-2,2)plt.ylim(-2,2)plt.grid()plt.show() 

似乎噪声已添加到x,y坐标中的每个坐标上,从而使每个点都稍微移动了一点.当我们检查

I am working on hyperparameter tuning of neural networks and going through examples. I came across this code in one example:

train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)

I understand that adding noise has regularization effect on data. Reading the documentation for this tells that it adds guassian noise. However, in above code, I could not understand what does it means to add 0.05 noise in the data. How would this affect data mathematically here?

I tried below code. I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05 to corresponding row in array 2 i.e. x_1 here?

np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])

x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])

Output:

[[-9.92114701e-01 -1.25333234e-01]
 [-1.49905052e-01 -7.85829801e-01]
 [ 9.68583161e-01  2.48689887e-01]
 [ 6.47213595e-01  4.70228202e-01]
 [-8.00000000e-01 -2.57299624e-16]]

[[-0.66187208  0.75151712]
 [-0.86331995 -0.56582111]
 [-0.19574479  0.7798686 ]
 [ 0.40634757 -0.78263011]
 [-0.7433193   0.26658851]]

解决方案

According to the documentation:

sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)
Make a large circle containing a smaller circle in 2d. A simple toy dataset to visualize clustering and classification algorithms.

noise: double or None (default=None) Standard deviation of Gaussian noise added to the data.

The statement make_circles(noise=0.05) means that it is creating random circles with a little bit of variation following a Gaussian distribution, also known as a normal distribution. You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition. In this case, the call make_circles(noise=0.05) means that the standard deviation is 0.05.

Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise. I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data.

Let's first call make_circles() with noise=0.0 and take a look at the data. I'll use a Pandas dataframe so we can see the data in a tabular way.

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd

n_samples = 100
noise = 0.00

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
#           x         y  label
# 0 -0.050232  0.798421      1
# 1  0.968583  0.248690      0
# 2 -0.809017  0.587785      0
# 3 -0.535827  0.844328      0
# 4  0.425779 -0.904827      0

You can see that make_circles returns data instances where each instance is a point with two features, x and y, and a label. Let's plot them to see how they actually look like.

# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

So it looks like it's creating two concentric circles, each with a different label.

Let's increase the noise to noise=0.05 and see the result:

n_samples = 100
noise = 0.05  # <--- The only change

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))

grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit. When we inspect the code for make_circles() we see that the implementation does exactly that:

def make_circles( ..., noise=None, ...):

    ...
    if noise is not None:
        X += generator.normal(scale=noise, size=X.shape)

So now we've seen two visualizations of the dataset with two values of noise. But two visualizations isn't cool. You know what's cool? Five visualizations with the noise increasing progressively by 10x. Here's a function that does it:

def make_circles_plot(n_samples, noise):

    assert n_samples > 0
    assert noise >= 0

    # Use make_circles() to generate random data points with noise.
    features, labels = make_circles(n_samples=n_samples, noise=noise)

    # Create a dataframe for later plotting.
    df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
    grouped = df.groupby('label')
    colors = {0:'red', 1:'blue'}

    fig, ax = plt.subplots(figsize=(5, 5))

    for key, group in grouped:
        group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
    plt.title('Points with noise=%f' % noise)
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.grid()
    plt.tight_layout()
    plt.show()

Calling the above function with different values of noise, it can clearly be seen that increasing this value makes the points move around more, i.e. it makes them more "noisy", exactly as we should expect intuitively.

for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
    make_circles_plot(500, noise)

这篇关于在scikit-learn make_circle()中添加高斯噪声= 0.05是什么意思?它将如何影响数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆