使用FFT分析Google趋势时间序列的季节性 [英] Analyzing seasonality of Google trend time series using FFT

查看:74
本文介绍了使用FFT分析Google趋势时间序列的季节性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用快速傅立叶变换来评估Google趋势时间序列的幅度谱.如果您在此处中提供的数据中查看饮食"的数据,它将显示一个非常强烈的季节性模式:

I am trying to evaluate the amplitude spectrum of the Google trends time series using a fast Fourier transformation. If you look at the data for 'diet' in the data provided here it shows a very strong seasonal pattern:

我认为我可以使用FFT分析这种模式,大概应该在1年内出现一个峰值.

I thought I could analyze this pattern using a FFT, which presumably should have a strong peak for a period of 1 year.

但是,当我应用这样的FFT时(a_gtrend_ham是时间序列乘以汉明窗):

However when I apply a FFT like this (a_gtrend_ham being the time series multiplied with a Hamming window):

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, fftshift
import pandas as pd

gtrend = pd.read_csv('multiTimeline.csv',index_col=0)

gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')

# Sampling rate
fs = 12 #Points per year

a_gtrend_orig = gtrend['diet: (Worldwide)']
N_gtrend_orig = len(a_gtrend_orig)
length_gtrend_orig = N_gtrend_orig / fs
t_gtrend_orig = np.linspace(0, length_gtrend_orig, num = N_gtrend_orig, endpoint = False)

a_gtrend_sel = a_gtrend_orig.loc['2005-01-01 00:00:00':'2017-12-01 00:00:00']
N_gtrend = len(a_gtrend_sel)
length_gtrend = N_gtrend / fs
t_gtrend = np.linspace(0, length_gtrend, num = N_gtrend, endpoint = False)

a_gtrend_zero_mean = a_gtrend_sel - np.mean(a_gtrend_sel)
ham = np.hamming(len(a_gtrend_zero_mean))
a_gtrend_ham = a_gtrend_zero_mean * ham

N_gtrend = len(a_gtrend_ham)
ampl_gtrend = 1/N_gtrend * abs(fft(a_gtrend_ham))
mag_gtrend = fftshift(ampl_gtrend)
freq_gtrend = np.linspace(-0.5, 0.5, len(ampl_gtrend))
response_gtrend = 20 * np.log10(mag_gtrend)
response_gtrend = np.clip(response_gtrend, -100, 100)

我得到的幅度谱没有显示任何主峰:

My resulting amplitude spectrum does not show any dominant peak:

我对如何使用FFT获取数据序列频谱的误解在哪里?

Where is my misunderstanding of how to use the FFT to get the spectrum of the data series?

推荐答案

以下是我认为您要实现的目标的清晰实现.我将提供图形输出,并简要讨论其可能的含义.

Here is a clean implementation of what I think you are trying to accomplish. I include graphical output and a brief discussion of what it likely means.

首先,我们使用rfft(),因为数据是真实值.这样可以节省时间和精力(并减少错误率),否则会产生冗余负频率.然后,我们使用rfftfreq()生成频率列表(再次,无需手动编码它,并且使用api可以降低错误率).

First, we use the rfft() because the data is real valued. This saves time and effort (and reduces the bug rate) that otherwise follows from generating the redundant negative frequencies. And we use rfftfreq() to generate the frequency list (again, it is unnecessary to hand code it, and using the api reduces the bug rate).

对于您的数据,Tukey窗口比Hamming和类似的基于cos或sin的窗口函数更合适.还要注意,在乘以窗口函数之前,我们减去了中位数.中位数()是对基线的相当可靠的估计,当然要比均值()高得多.

For your data, the Tukey window is more appropriate than the Hamming and similar cos or sin based window functions. Notice also that we subtract the median before multiplying by the window function. The median() is a fairly robust estimate of the baseline, certainly more so than the mean().

在该图中,您可以看到数据从其初始值迅速下降,然后以低端结束. Hamming窗口和类似的窗口为此而对中间区域进行了太窄的采样,不必要地衰减了许多有用的数据.

In the graph you can see that the data falls quickly from its intitial value and then ends low. The Hamming and similar windows, sample the middle too narrowly for this and needlessly attenuate a lot of useful data.

对于FT图,我们跳过零频点(第一点),因为它仅包含基线,而省略它则为y轴提供了更方便的缩放比例.

For the FT graphs, we skip the zero frequency bin (the first point) since this only contains the baseline and omitting it provides a more convenient scaling for the y-axes.

您会在FT输出的图表中注意到一些高频分量. 我在下面添加了一个示例代码,说明了这些高频分量的可能来源.

You will notice some high frequency components in the graph of the FT output. I include a sample code below that illustrates a possible origin of those high frequency components.

好的,这是代码:

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
from scipy.signal import tukey

from numpy.fft import fft, fftshift
import pandas as pd

gtrend = pd.read_csv('multiTimeline.csv',index_col=0,skiprows=2)
#print(gtrend)

gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
#print(gtrend.index)

a_gtrend_orig = gtrend['diet: (Worldwide)']
t_gtrend_orig = np.linspace( 0, len(a_gtrend_orig)/12, len(a_gtrend_orig), endpoint=False )

a_gtrend_windowed = (a_gtrend_orig-np.median( a_gtrend_orig ))*tukey( len(a_gtrend_orig) )

plt.subplot( 2, 1, 1 )
plt.plot( t_gtrend_orig, a_gtrend_orig, label='raw data'  )
plt.plot( t_gtrend_orig, a_gtrend_windowed, label='windowed data' )
plt.xlabel( 'years' )
plt.legend()

a_gtrend_psd = abs(rfft( a_gtrend_orig ))
a_gtrend_psdtukey = abs(rfft( a_gtrend_windowed ) )

# Notice that we assert the delta-time here,
# It would be better to get it from the data.
a_gtrend_freqs = rfftfreq( len(a_gtrend_orig), d = 1./12. )

# For the PSD graph, we skip the first two points, this brings us more into a useful scale
# those points represent the baseline (or mean), and are usually not relevant to the analysis
plt.subplot( 2, 1, 2 )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psd[1:], label='psd raw data' )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psdtukey[1:], label='windowed psd' )
plt.xlabel( 'frequency ($yr^{-1}$)' )
plt.legend()

plt.tight_layout()
plt.show()

这是图形显示的输出.在1/year和0.14处有很强的信号(恰好是1/14年的1/2),并且还有一些较高频率的信号,乍一看似乎很神秘.

And here is the output displayed graphically. There are strong signals at 1/year and at 0.14 (which happens to be 1/2 of 1/14 yrs), and there is a set of higher frequency signals that at first perusal might seem quite mysterious.

我们看到开窗功能实际上对于将数据带到基线非常有效,并且您发现通过应用开窗功能,FT中的相对信号强度并没有太大改变.

We see that the windowing function is actually quite effective in bringing the data to baseline and you see that the relative signal strengths in the FT are not altered very much by applying the window function.

如果您仔细查看数据,那么一年之内似乎会有一些重复的变化.如果这些事件以一定规律性发生,则可以预期它们会在FT中作为信号出现,实际上,FT中是否存在信号通常被用来区分信号和噪声.但是,正如将显示的那样,对于高频信号有更好的解释.

If you look at the data closely, there seems to be some repeated variations within the year. If those occur with some regularity, they can be expected to appear as signals in the FT, and indeed the presence or absence of signals in the FT is often used to distinguish between signal and noise. But as will be shown, there is a better explanation for the high frequency signals.

好的,现在这里是一个示例代码,它说明了产生这些高频分量的一种方法.在此代码中,我们创建一个单一的音调,然后创建与该音调具有相同频率的一组尖峰.然后我们对两个信号进行傅立叶变换,最后绘制原始数据和FT数据.

Okay, now here is a sample code that illustrates one way those high frequency components can be produced. In this code, we create a single tone, and then we create a set of spikes at the same frequency as the tone. Then we Fourier transform the two signals and finally, graph the raw and FT data.

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq

t = np.linspace( 0, 1, 1000. )

y = np.cos( 50*3.14*t )

y2 = [ 1. if 1.-v < 0.01 else 0. for v in y ]

plt.subplot( 2, 1, 1 )
plt.plot( t, y, label='tone' )
plt.plot( t, y2, label='spikes' )
plt.xlabel('time')

plt.subplot( 2, 1, 2 )
plt.plot( rfftfreq(len(y),d=1/100.), abs( rfft(y) ), label='tone' )
plt.plot( rfftfreq(len(y2),d=1/100.), abs( rfft(y2) ), label='spikes' )
plt.xlabel('frequency')
plt.legend()

plt.tight_layout()
plt.show()

好的,这是色调,尖峰的图形,然后是它们的傅立叶变换.请注意,尖峰产生的高频成分与我们的数据非常相似.

Okay, here are the graphs of the tone, and the spikes, and then their Fourier transforms. Notice that the spikes produce high frequency components that are very similar to those in our data.

换句话说,在与原始数据中信号的尖峰特性相关的短时间范围内,高频分量的起源很有可能.

In other words, the origin of the high frequency components is very likely in the short time scales associated with the spikey character of signals in the raw data.

这篇关于使用FFT分析Google趋势时间序列的季节性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆