重新采样多个CSV文件并使用新名称自动保存重新采样的文件 [英] Resampling Multiple CSV Files and Automatically Saving Resampled Files with New Names

查看:213
本文介绍了重新采样多个CSV文件并使用新名称自动保存重新采样的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我已经努力了一个多星期来解决这个问题,似乎找不到一个解决方案。一些编程人员在帮助方面表现非常出色,但不幸的是没有人提供对我有用的建议。我将尝试尽可能简单地提出相同的问题。



我有很多(超过100个)csv文件。所有csv文件都有Datetime作为其第一列。 日期时间格式为YYYY-MM-DD HH:MM:SS。每个文件在整个月内每15分钟提供一行数据(很多数据行)。所有csv文件位于三个不同的文件夹中,每个文件夹都有以下路径:



C:\Users\Documents\SummaryData\24Hour p>

C:\Users\Documents\SummaryData\Daytime



C:\Users \Documents\SummaryData\Nighttime



24小时文件夹中的csv文件夹跨越24小时制。
白天文件夹中的csv文件跨度为06:00至18:00,适用于MM:SS。
Nightime文件夹中的csv文件跨度为MM:SS的18:00 - 06:00。



例如,一个csv文件存在2015年8月。在本月的24小时文件夹中,我们有一个csv文件,可在2015年8月的整个月内提供不间断的15分钟间隔数据。



对于同一个月和一年,在Daytime文件夹中存在另一个csv文件,该文件仅在06:00至18:00之间提供数据。例如,请参阅下面的文件的片断。我选择随机提供8月12日开始的数据。





进一步进入这个月,例如:





相同的文件存在于夜间,但跨越整个时间。



请注意,列出的列数多于上面的图像存在。



在保留这些原始的15分钟间隔文件的同时,我需要对所有csv文件重新采样,以便每个文件都有自己的每小时,每日和每月文件。这个棘手的部分是,我想让一些列在重采样时间框架中相加,而其他列需要在时间范围内进行平均。



所以如果我正在对当天的数据进行重新抽样,我需要一些列来平衡其数据,而其他数据则在一天之内达到数据。但是,尽管如此,我需要一个从这些原始的15分钟间隔csv文件创建的一个每日csv文件。在所有文件中,尽管具有相同标题名称的列需要相同的重采样(因此,如果列[windspeed]需要在一天内进行平均,则对于另一个csv文件中的列[windspeed]将是相同的)另一个棘手的部分是,我还需要将这些文件导出为csv文件(到任何输出位置,例如C:\Users\cp_vm\\ \\ Documents\Output),并自动重命名,以表示如何重新采样。



所以以2015年8月的csv文件为例,当前命名为:



2015August.csv,



如果我将此文件重新取样为小时,每日和每月,想要将所有这些新的重新采样的csv文件保存为:



2015AugustHourly.csv和;



2015AugustDaily.csv和



2015AugustMonthly.csv。



我知道我需要使用一些形式为'for循环',我真的试过了。但我不能想出这一个。任何帮助将不胜感激!感谢所有已经提供建议的人。



下面的输出示例显示了在几小时内平均的值:





下面的示例显示了一些额外的列(SR_Gen和SR_All),它们是15分钟的数据。



解决方案

我想你可以重复使用我们以前的工作中的代码( here )。使用原始代码,当创建 NIGHT DAY 数据框时,您可以每小时对其进行重新取样,并按月计算,并将新的(重新采样的)数据框保存为 .csv 文件,无论你喜欢。



我是将使用示例数据框(这里显示的前3行):

 日期PRP PRe Norm_Eff SR_Gen SR_All 
2016- 01-01 00:00:00 0.269389 0.517720 0.858603 8123.746453 8770.560467
2016-01-01 00:15:00 0.283316 0.553203 0.862253 7868.675481 8130.974409
2016-01-01 00:30:00 0.286590 0.693997 0.948463 8106.217144 8314.584848



全部代码



  import pandas as pd 
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar

#我用虚拟数据
start =定义了一个样本数据框datetime.datetime(2016,1,1,0,0)
r = range(0,10000)

dates = [start + relativedelta(minutes = 15 * i)for i in r ]
PRp = [random.uniform(.2,.3)for i in r]
PRe = [random.uniform(0.5,.7)for i in r]
Norm_Eff = [random.uniform(0.7,1)for i in r]
SR_Gen = [random.uniform(7500,8500)for i in r]
SR_All = [random.uniform(8000,9500)for我在r]

DF = pd.DataFrame({
'dates':dates,
'PRp':PRp,
'PRe':PRe,
'Norm_Eff':Norm_Eff,
'SR_Gen':SR_Gen,
'SR_All':SR_All,
})



#定义何时开始和结束(必须使用24个时钟)
day = {
'start':datetime.time(6,0),#从6am(6:00)开始
'end':datetime.time(18,0)#在下午6点(18:00)结束
}


#捕获出现在数据框中的年份
min_year = DF.dates.min()。year
max_year = DF.dates.max()。year

如果min_year == max_year:
yearRange = [min_year]
else:
yearRange = range(min_year,max_year + 1)

#每年迭代,每个月内每年
年份范围:
范围内的月份(1,13):

#过滤器显示给定月份中给定月份的NIGHT和DAY数据框
NIGHT = DF [(DF.dates> = datetime.datetime(年,月,1))&
(DF.dates< = datetime.datetime(年,月,1)+ relativedelta(months = 1) - relativedelta(days = 1))&
((DF.dates.apply(lambda x:x.time())< = day ['start'])|(DF.dates.apply(lambda x:x.time())> = day ['end'])]]

DAY = DF [(DF.dates> = datetime.datetime(year,month,1))&
(DF.dates< = datetime.datetime(年,月,1)+ relativedelta(months = 1) - relativedelta(days = 1))&
((DF.dates.apply(lambda x:x.time())> day ['start'])&(DF.dates.apply(lambda x:x.time())< day ['end'])]

#在小时,每日,每月基础上创建重新采样的数据帧
为resample_freq,freq_tag为zip(['H','D','M '],['Hourly','Daily','Monthly']):

NIGHT.index = NIGHT.dates#resampled列必须放在索引
NIGHT_R = pd.DataFrame (data = {
'PRp':NIGHT.PRp.resample(rule = resample_freq).mean(),#average data
'PRe':NIGHT.PRe.resample(rule = resample_freq).mean (),
'Norm_Eff':NIGHT.Norm_Eff.resample(rule = resample_freq).mean(),
'SR_Gen':NIGHT.SR_Gen.resample(rule = resample_freq).sum(),#求和数据
'SR_All':NIGHT.SR_All.resample(rule = resample_freq).sum()
})
NIGHT_R.dropna(inplace = True)#删除'day'期间的时间(显示为NA)

DAY.index = DAY.dates
DAY_R = pd .DataFrame(data = {
'PRp':DAY.PRp.resample(rule = resample_freq).mean(),
'PRe':DAY.PRe.resample(rule = resample_freq).mean ),
'Norm_Eff':DAY.Norm_Eff.resample(rule = resample_freq).mean(),
'SR_Gen':DAY.SR_Gen.resample(rule = resample_freq).sum(),
'SR_All':DAY.SR_All.resample(rule = resample_freq).sum()
})
DAY_R.dropna(inplace = True)#删除夜期间的时间(显示为NA)

#保存到.csv文件名中的日期和时间
#指定选择的保存路径
path_night ='C:\\Users\ \\\
ickb\\Desktop\\stack overflow \\ {0} {1} _NIGHT_ {2} .csv'.format(year,calendar.month_name [month],freq_tag)
path_day ='C:\\Users\\\\ nickb\\Desktop\\stackoverflow\\ {0} {1} _DAY_ {2} .csv'.format(year,calendar.month_name [month],freq_tag)

#上面的一些NIGHT_R / DAY_R过滤将不返回任何行。
#检查这个,只有数据框包含行
,如果NIGHT_R.shape [0]> 0:
NIGHT_R.to_csv(path_night,index = True)
如果DAY_R.shape [0]> 0:
DAY_R.to_csv(path_day,index = True)

以上将导致每个月共有SIX .csv 文件:


  1. 白天小时

  2. 白天每日基础

  3. 白天每月基数

  4. 夜间小时

  5. 夜间每日基础

  6. 夜间月度

每个文件将具有以下文件名:(年)(Month_Name)(日/夜)(频率)。例如: 2016August_NIGHT_Daily



让我知道上述是否达到目标。



此外,这里列出了您可以选择的 resample 频率列表:大熊猫resample文档


I've really tried hard now for over a week to solve this problem and I cannot seem to find a solution. Some coders have been excellent in helping but unfortunately no one is yet to provide suggestions that have worked for me. I'm going to try and ask the same question as simply as possible.

I have many (over 100) csv files. All csv files have "Datetime" as their first column. The "Datetime" format is in "YYYY-MM-DD HH:MM:SS". Each file provides rows of data every 15 minutes over an entire month (a lot of rows of data). All the csv files are located across three separate folders, each with the following paths:

"C:\Users\Documents\SummaryData\24Hour"

"C:\Users\Documents\SummaryData\Daytime"

"C:\Users\Documents\SummaryData\Nighttime"

The csv files in the 24 Hour folder span 24 hour time frames. The csv files in the Daytime folder span 06:00 - 18:00 for MM:SS. The csv files in the Nightime folder span 18:00 - 06:00 for MM:SS.

For example, a csv file exists for the month of August in 2015. For this month, in the 24 Hour folder, we have a csv file that provides non-stop 15-minute interval data for the entire month of August in 2015.

For the same month and year, another csv file exists, in the Daytime folder, that provides data just for the times from 06:00 - 18:00. For example, see below a snippet of the file. I chose at random to provide data starting on the 12th day of August.

And further into the month for example:

The same files exist for Nighttime as well but span time throughout the night.

Please note that a lot more columns than the ones shown in the above images exist.

While keeping these original 15-minute interval files, I need to resample all csv files so each one has its own Hourly, Daily and Monthly file. The tricky part, is that I want some of the columns to be summed over the resampling time frame, while other columns need to be averaged over the time frame.

So if I'm resampling the data for the day, I need some columns to average its data over the day, while others sum its data over the day. But nonetheless, I need one daily csv file that has been created from these original 15-minute interval csv files. Across all files though, columns with the same header name need the same resampling (so if column["windspeed"] needs to be averaged over the day, then that will be the same for column["windspeed"] in another csv file).

Another tricky part, is that I also need these files to be exported as csv files (to any output location, say "C:\Users\cp_vm\Documents\Output") and renamed automatically to signify how they have been resampled.

So taking the example of the csv file for August in 2015 that is currently named:

"2015August.csv",

if I resample this file to be hourly, daily and monthly, I want all these new resampled csv files to be saved as:

"2015AugustHourly.csv" and;

"2015AugustDaily.csv" and;

"2015AugustMonthly.csv" respectively.

I know I need to use some form of 'for loop' and I really have tried. But I cannot figure this one out. Any help would be greatly appreciated! And thanks to all those who have already provided advice.

Output example below showing the values being averaged over the hours:

The below example shows some of the additional columns (SR_Gen and SR_All) which are results of summing the 15-minute data over the hours.

解决方案

I think you can re-use the code from our previous work (here). Using the original code, when the NIGHT and DAY dataframes are created, you can then resample them on an hourly, daily, and monthly basis and save the new (resampled) dataframes as .csv files wherever you like.

I am going to use a sample dataframe (first 3 rows shown here):

dates               PRp         PRe         Norm_Eff    SR_Gen      SR_All
2016-01-01 00:00:00 0.269389    0.517720    0.858603    8123.746453 8770.560467
2016-01-01 00:15:00 0.283316    0.553203    0.862253    7868.675481 8130.974409
2016-01-01 00:30:00 0.286590    0.693997    0.948463    8106.217144 8314.584848

Full Code

import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar

# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
r = range(0,10000)

dates = [start + relativedelta(minutes=15*i) for i in r]
PRp = [random.uniform(.2, .3) for i in r]
PRe = [random.uniform(0.5, .7) for i in r]
Norm_Eff = [random.uniform(0.7, 1) for i in r]
SR_Gen = [random.uniform(7500, 8500) for i in r]
SR_All = [random.uniform(8000, 9500) for i in r]

DF = pd.DataFrame({
        'dates': dates,
        'PRp': PRp,
        'PRe': PRe,
        'Norm_Eff': Norm_Eff,
        'SR_Gen': SR_Gen,
        'SR_All': SR_All,
    })



# define when day starts and ends (MUST USE 24 CLOCK)
day = {
        'start': datetime.time(6,0),  # start at 6am (6:00)
        'end': datetime.time(18,0)  # ends at 6pm (18:00)
      }


# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year

if min_year == max_year:
    yearRange = [min_year]
else:
    yearRange = range(min_year, max_year+1)

# iterate over each year and each month within each year
for year in yearRange:
    for month in range(1,13):

        # filter to show NIGHT and DAY dataframe for given month within given year
        NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]

        DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]

        # Create resampled dataframes on Hourly, Daily, Monthly basis
        for resample_freq, freq_tag in zip(['H','D','M'], ['Hourly','Daily','Monthly']):

            NIGHT.index = NIGHT.dates                           # resampled column must be placed in index
            NIGHT_R = pd.DataFrame(data={
                    'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(),            # averaging data
                    'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(),        # summing data
                    'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum()  
                })
            NIGHT_R.dropna(inplace=True)  # removes the times during 'day' (which show as NA)

            DAY.index = DAY.dates
            DAY_R = pd.DataFrame(data={
                    'PRp': DAY.PRp.resample(rule=resample_freq).mean(),
                    'PRe': DAY.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(),        
                    'SR_All': DAY.SR_All.resample(rule=resample_freq).sum()  
                })
            DAY_R.dropna(inplace=True)  # removes the times during 'night' (which show as NA)

            # save to .csv with date and time in file name
            # specify the save path of your choice
            path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
            path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag)

            # some of the above NIGHT_R / DAY_R filtering will return no rows.
            # Check for this, and only save if the dataframe contains rows
            if NIGHT_R.shape[0] > 0:
                NIGHT_R.to_csv(path_night, index=True)
            if DAY_R.shape[0] > 0:
                DAY_R.to_csv(path_day, index=True)

The above will result in a total of SIX .csv files per month:

  1. Hourly basis for daytime
  2. Daily basis for daytime
  3. Monthly basis for daytime
  4. Hourly basis for nighttime
  5. Daily basis for nighttime
  6. Monthly basis for nighttime

Each file will have a file name as follows: (Year)(Month_Name)(Day/Night)(frequency). For example: 2016August_NIGHT_Daily

Let me know if the above achieves the goal or not.

Also, here is a list of available resample frequencies you can choose from: pandas resample documentation

这篇关于重新采样多个CSV文件并使用新名称自动保存重新采样的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆