在Python中连续数据的箱形图 [英] Box plot for continuous data in Python
问题描述
我有一个包含两列的csv文件:
-
col1-
时间戳
data(yyyy-mm-dd hh:mm:ss.ms(8个月数据)) -
col2:热量数据(连续变量)。
由于记录将近5万,我想将col1(timestamp col)分为几个月或几周,然后将箱形图应用于热量数据时间戳。我在R中尝试过
,需要很长时间。需要帮助以Python进行。我想我需要使用 seaborn.boxplot
。
请指导。
按频率分组,然后地块组
第一个
heat = np.random.random(24 * 300)* 100
date = pd.date_range('1/1/2011', period = 24 * 300,freq ='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index (时间)
要将数据划分为五个时间段,然后每周获取每个的箱图:
确定总时间跨度;除以五创建频率别名;然后groupby
dt = df.index [-1]-df.index [0]
dt = dt / 5
别名= f'{dt.total_seconds()} S'
gb = df.groupby(pd.Grouper(freq = alias))
每个组都是一个DataFrame,因此可以遍历这些组;在每个组中创建每周组,并对其进行框线绘制。
为gb中的g,d_frame:
gb_tmp = d_frame.groupby(pd.Grouper(freq ='7D'))
ax = gb_tmp.boxplot(subplots = False)
plt.setp(ax.xaxis.get_ticklabels(),rotation = 90)
plt.show()
plt.close()
有这样做可能是更好的方法,如果这样的话,我会发布它,或者有人会免费填写以进行编辑。看起来这可能导致最后一组没有完整的数据集。 ...
如果您知道数据是周期性的,则可以使用切片将其拆分。
n = len(df)// 5
for tmp_df in(df [i:i + n] for i在范围(0,len(df),n)) :
gb_tmp = tmp_df.groupby(pd.Grouper(freq ='7D'))
ax = gb_tmp.boxplot(subplots = False)
plt.setp(ax.xaxis.get_ticklabels( ),rotation = 90)
plt.show()
plt.close()
频率别名
pandas.read_csv()
pandas.Grouper()
I have a csv file with 2 columns:
col1-
Timestamp
data(yyyy-mm-dd hh:mm:ss.ms (8 months data))col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot
.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()
这篇关于在Python中连续数据的箱形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!