趋势线绘图不适用于bigdataset [英] Trendline plotting not working with bigdataset

查看:93
本文介绍了趋势线绘图不适用于bigdataset的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含52166个数据点的大型数据集,看起来像这样:

                     bc_conc    
2010-04-09 10:00:00  609.542000          
2010-04-09 11:00:00  663.500000          
2010-04-09 12:00:00  524.661667         
2010-04-09 13:00:00  228.706667           
2010-04-09 14:00:00  279.721667         

这是一个熊猫数据框,索引位于日期时间.现在,我想针对时间绘制bc_conc的数据并添加一条趋势线.

我使用了以下代码:

data = data.resample('M', closed='left', label='left').mean()
x1 = data.index
x2 = matplotlib.dates.date2num(data.index.to_pydatetime())
y = data.bc_conc
z = np.polyfit(x2, y, 1)
p = np.poly1d(z)
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
plt.plot_date(x=x1, y=y, fmt='b-')
plt.plot(x1, p(x2), 'ro')
plt.show()

但是,如您所见,我对数据进行了重新采样.我这样做是因为我没有,代码只是给了我一个没有趋势线的数据图.如果我将它们重新采样到几天,则情节仍然没有趋势线.如果我将它们重新采样到几个月,则会显示一条趋势线.

似乎该代码仅适用于较小的数据集.为什么是这样?我想知道任何人都可以向我解释这一点,因为我想将数据重采样到几天而不是更长时间.

预先感谢

解决方案

无论使用每小时还是每天重新采样的数据,此代码都可以正常工作.

从100,000个数据点开始:

y = np.arange(0, 1000, .01) + np.random.normal(0, 100, 100000)
data = pd.DataFrame(data={'bc_conc': y}, index=pd.date_range(freq='H', start=datetime(2000, 1, 1), periods=len(y)))

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2000-01-01 00:00:00 to 2011-05-29 15:00:00
Freq: H
Data columns (total 1 columns):
bc_conc    100000 non-null float64
dtypes: float64(1)

                        bc_conc
2000-01-01 00:00:00  -30.639811
2000-01-01 01:00:00  -26.791396
2000-01-01 02:00:00 -121.542718
2000-01-01 03:00:00  -69.267944
2000-01-01 04:00:00  117.731532

使用可选的重采样计算趋势线:

data = data.resample('D', closed='left', label='left').mean() # optional for daily data
x2 = matplotlib.dates.date2num(data.index.to_pydatetime()) # Dates to float representing (fraction of) days since 0001-01-01 00:00:00 UTC plus one

[ 730120.  730121.  730122. ...,  734284.  734285.  734286.]

z = np.polyfit(x2, data.bc_conc, 1)

[  2.39988999e-01  -1.75220741e+05]  # coefficients

p = np.poly1d(z)

0.24 x - 1.752e+05 # fitted polynomial

data['trend'] = p(x2)  # trend from polynomial fit

              bc_conc     trend
2000-01-01 -29.794608  0.026983
2000-01-02   6.727729  0.266972
2000-01-03   9.815476  0.506961
2000-01-04 -27.954068  0.746950
2000-01-05 -13.726714  0.986939

data.plot()
plt.show()

收益:

I have a big dataset with 52166 datapoints and which looks like this:

                     bc_conc    
2010-04-09 10:00:00  609.542000          
2010-04-09 11:00:00  663.500000          
2010-04-09 12:00:00  524.661667         
2010-04-09 13:00:00  228.706667           
2010-04-09 14:00:00  279.721667         

It is a pandas dataframe and the index is on the datetime. Now I like to plot the data of bc_conc against the time and add a trendline.

I used the following code:

data = data.resample('M', closed='left', label='left').mean()
x1 = data.index
x2 = matplotlib.dates.date2num(data.index.to_pydatetime())
y = data.bc_conc
z = np.polyfit(x2, y, 1)
p = np.poly1d(z)
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
plt.plot_date(x=x1, y=y, fmt='b-')
plt.plot(x1, p(x2), 'ro')
plt.show()

However, as you can see I resampled my data. I did this because of I don't, the code just gives me a plot of the data without the trendline. If I resample them to days the plot is still without trendline. If I resample them to months, a trendline shows.

It seems as if the code only works for a smaller dataset. Why is this? I was wondering of anyone could explain this to me, because I like to resample my data to days, but not further..

Thanks in advance

解决方案

This code seems to work fine, whether using hourly or daily resampled data.

Starting with 100,000 data points:

y = np.arange(0, 1000, .01) + np.random.normal(0, 100, 100000)
data = pd.DataFrame(data={'bc_conc': y}, index=pd.date_range(freq='H', start=datetime(2000, 1, 1), periods=len(y)))

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2000-01-01 00:00:00 to 2011-05-29 15:00:00
Freq: H
Data columns (total 1 columns):
bc_conc    100000 non-null float64
dtypes: float64(1)

                        bc_conc
2000-01-01 00:00:00  -30.639811
2000-01-01 01:00:00  -26.791396
2000-01-01 02:00:00 -121.542718
2000-01-01 03:00:00  -69.267944
2000-01-01 04:00:00  117.731532

Calculation of trendline with optional resampling:

data = data.resample('D', closed='left', label='left').mean() # optional for daily data
x2 = matplotlib.dates.date2num(data.index.to_pydatetime()) # Dates to float representing (fraction of) days since 0001-01-01 00:00:00 UTC plus one

[ 730120.  730121.  730122. ...,  734284.  734285.  734286.]

z = np.polyfit(x2, data.bc_conc, 1)

[  2.39988999e-01  -1.75220741e+05]  # coefficients

p = np.poly1d(z)

0.24 x - 1.752e+05 # fitted polynomial

data['trend'] = p(x2)  # trend from polynomial fit

              bc_conc     trend
2000-01-01 -29.794608  0.026983
2000-01-02   6.727729  0.266972
2000-01-03   9.815476  0.506961
2000-01-04 -27.954068  0.746950
2000-01-05 -13.726714  0.986939

data.plot()
plt.show()

Yields:

这篇关于趋势线绘图不适用于bigdataset的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆