在组和子组中拆分数据框并将输出存储在CSV文件中 [英] Split dataframes in groups and sub-groups and store the output in a CSV file

查看:138
本文介绍了在组和子组中拆分数据框并将输出存储在CSV文件中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个这样的大数据框:

  ABC 
27/6/2017 4:00: 00 928.04 4.83
27/6/2017 4:20:00 927.71 4.61
27/6/2017 4:40:00 928.22 4.49
27/6/2017 5:00:00 898.74 3.81
27/6/2017 5:20:00 895.16 3.55
27/6/2017 5:40:00 895.05 3.4
27/6/2017 6:00:00 895.68 3.3
27/6/2017 16:20:00 662.45 1.52
27/6/2017 16:40:00 639.98 1.48
27/6/2017 17:40:00 732.02 1.79
27/6/2017 18:00:00 722.63 1.98
27/6/2017 18:20:00 713.26 1.79
27/6/2017 18:40:00 705.8 1.54
27 / 6/2017 19:00:00 652.1 1.51
27/6/2017 19:20:00 638.58 1.68
27/6/2017 19:40:00 633.14 1.66
27/6 / 2017 20:00:00 654.66 1.45

我想根据差异来拆分数据帧小时,即如果两个时间戳之间的差异超过4小时,它将拆分数据帧。然后,我想根据B的值范围将这两个数据框分成子组。我想将所有这些组和子组存储在单独的csv文件中。



所需输出:

Group1:

  ABC 
27/6/2017 4:00:00 928.04 4.83
27/6/2017 4:20:00 927.71 4.61
27/6 / 2017 4:40:00 928.22 4.49
27/6/2017 5:00:00 898.74 3.81
27/6/2017 5:20:00 895.16 3.55
27/6/2017 5 :40:00 895.05 3.4
27/6/2017 6:00:00 895.68 3.3

Group2:

  ABC 
27/6/2017 16:20:00 662.45 1.52
27 / 6/2017 16:40:00 639.98 1.48
27/6/2017 17:40:00 732.02 1.79
27/6/2017 18:00:00 722.63 1.98
27/6 / 2017 18:20:00 713.26 1.79
27/6/2017 18:40:00 705.8 1.54
27/6/2017 19:00:00 652.1 1.51
27/6/2017 19 :20:00 638.58 1.68
27/6/2017 19:40:00 633.14 1.66
27/6/2017 20:00:00 654.66 1.45



<区域>



Group1 Zone1:

  ABC 
2 7/6/2017 4:00:00 928.04 4.83
27/6/2017 4:20:00 927.71 4.61
27/6/2017 4:40:00 928.22 4.49

GRoup1 ZOne2:

  ABC 
27/6/2017 5:00:00 898.74 3.81
27/6/2017 5:20:00 895.16 3.55
27/6/2017 5:40:00 895.05 3.4
27/6/2017 6:00:00 895.68 3.3

喜欢这个。



我尝试了一些逻辑来实现这一点,但我无法做到这一点。



代码:

  time_diff = df [Time]。diff()

zones = []
dfs = DataFrame

zone =(dfs [Time]> =(dfs [Time]。shift()+ time_diff [1] * 12))。cumsum()
zone_grp = dfs.groupby(zone)

xyz = []
for k,g in zone_grp:
if len(g)> = 30:
zone .append(g)
else:
在范围内通过
(len(zones)):
zone_df = DataFrame(zones [m])
x = range (len(zone_df))
y = zone_df [T401FN1VT4000]

abc = Series((linregress(x,y)))
abc = DataFrame(abc).T
slope = abc [0] .tolist()
intercept = abc [1] .tolist()
abc = DataFrame({Slope:slope,Intercept:截距})
xyz.append(abc)
zone_df.to_csv(Zone_%s.csv%m,index = False)

xyz = concat(xyz).reset_index()
del xyz [index ]
xyz [Zone] = xyz.index
xyz = xyz.set_index(Zone)
xyz.to_csv(Coefficients.csv,index = True)

请帮助我以更好的方式根据时差分割数据框,并帮助我存储组以及不同名称的csv文件中的子组。



任何帮助都会被理解。

解决方案

您可以为第一级groupby使用 diff pd.Timedelta ,并且 df.B // x * x B 分成不同的组别。

  GRPS = [(df.A.diff()> (dps.groupby(grps)):
g.to_csv('{} pd.Timedelta(hours = 4))。cumsum(),df.B // 100 * 100]
_ {}。csv'.format(* i))
print(g)






  ABC 
3 2017-06-27 05:00:00 898.74 3.81
4 2017-06-27 05: 20:00 895.16 3.55
5 2017-06-27 05:40:00 895.05 3.40
6 2017-06-27 06:00:00 895.68 3.30

ABC
0 2017-06-27 04:00:00 928.04 4.83
1 2017-06-27 04:20:00 927.71 4.61
2 2017-06-27 04:40:00 928.22 4.49

ABC
7 2017-06-27 16:20:00 662.45 1.52
8 2017-06-27 16:40:00 639.98 1.48
13 2017-06 -27 19:00:00 652.10 1.51
14 2017-06-27 19:20:00 638.58 1.68
15 2017-06-27 19:40:00 633.14 1.66
16 2017- 06-27 20:00:00 654.66 1.45

ABC
9 2017-06-27 17:40:00 732.02 1.79
2017-06-27 18:00: 00 722.63 1.98
11 2017-06-27 18:20:00 713.26 1.79
12 2017-06-27 18:40:00 705.80 1.54


Suppose I have a large dataframe like this:

A                     B      C
27/6/2017 4:00:00   928.04  4.83
27/6/2017 4:20:00   927.71  4.61
27/6/2017 4:40:00   928.22  4.49
27/6/2017 5:00:00   898.74  3.81
27/6/2017 5:20:00   895.16  3.55
27/6/2017 5:40:00   895.05  3.4
27/6/2017 6:00:00   895.68  3.3
27/6/2017 16:20:00  662.45  1.52
27/6/2017 16:40:00  639.98  1.48
27/6/2017 17:40:00  732.02  1.79
27/6/2017 18:00:00  722.63  1.98
27/6/2017 18:20:00  713.26  1.79
27/6/2017 18:40:00  705.8   1.54
27/6/2017 19:00:00  652.1   1.51
27/6/2017 19:20:00  638.58  1.68
27/6/2017 19:40:00  633.14  1.66
27/6/2017 20:00:00  654.66  1.45

I want to split the dataframe on the basis of difference of hours i.e. if the difference between two timestamp is more than 4 hours it will split the dataframe. Then i want to split those two data frames in subgroups on the basis of range of values of B. I want to store all those groups and sub groups in an individual csv files.

Desired output:

Group1:

A                     B      C
27/6/2017 4:00:00   928.04  4.83
27/6/2017 4:20:00   927.71  4.61
27/6/2017 4:40:00   928.22  4.49
27/6/2017 5:00:00   898.74  3.81
27/6/2017 5:20:00   895.16  3.55
27/6/2017 5:40:00   895.05  3.4
27/6/2017 6:00:00   895.68  3.3

Group2:

A                     B      C
27/6/2017 16:20:00  662.45  1.52
27/6/2017 16:40:00  639.98  1.48
27/6/2017 17:40:00  732.02  1.79
27/6/2017 18:00:00  722.63  1.98
27/6/2017 18:20:00  713.26  1.79
27/6/2017 18:40:00  705.8   1.54
27/6/2017 19:00:00  652.1   1.51
27/6/2017 19:20:00  638.58  1.68
27/6/2017 19:40:00  633.14  1.66
27/6/2017 20:00:00  654.66  1.45

Zones:

Group1 Zone1:

A                     B      C
27/6/2017 4:00:00   928.04  4.83
27/6/2017 4:20:00   927.71  4.61
27/6/2017 4:40:00   928.22  4.49

GRoup1 ZOne2:

A                     B      C
27/6/2017 5:00:00   898.74  3.81
27/6/2017 5:20:00   895.16  3.55
27/6/2017 5:40:00   895.05  3.4
27/6/2017 6:00:00   895.68  3.3

LIke this.

I have tried some logics to achieve this but i couldn't able to do this.

Code:

time_diff = df["Time"].diff()

zones = []
dfs = DataFrame

zone = (dfs["Time"] >= (dfs["Time"].shift() + time_diff[1]*12)).cumsum()
zone_grp = dfs.groupby(zone)

xyz = []
for k,g in zone_grp:
    if len(g) >= 30:
        zones.append(g)
    else:
        pass
for m in range(len(zones)):
    zone_df = DataFrame(zones[m])
    x = range(len(zone_df))
    y = zone_df["T401FN1VT4000"]

    abc = Series((linregress(x,y)))
    abc = DataFrame(abc).T
    slope = abc[0].tolist()
    intercept = abc[1].tolist()
    abc = DataFrame({"Slope":slope,"Intercept":intercept})
    xyz.append(abc)
    zone_df.to_csv("Zone_%s.csv" %m, index = False)

xyz = concat(xyz).reset_index()
del xyz["index"]
xyz["Zone"] = xyz.index
xyz = xyz.set_index("Zone")
xyz.to_csv("Coefficients.csv", index = True)

Please help me to split the dataframe on the basis of time difference in a better way and help me to store the groups and sub groups in csv files with different names.

Any help would be appreciated.

解决方案

You could use diff and pd.Timedelta for the first level groupby, and df.B // x * x to divide B into ranged groups.

grps = [(df.A.diff() > pd.Timedelta(hours=4)).cumsum(), df.B // 100 * 100]
for i, g in df.groupby(grps):
     g.to_csv('{}_{}.csv'.format(*i))
     print(g)


                    A       B     C
3 2017-06-27 05:00:00  898.74  3.81
4 2017-06-27 05:20:00  895.16  3.55
5 2017-06-27 05:40:00  895.05  3.40
6 2017-06-27 06:00:00  895.68  3.30 

                    A       B     C
0 2017-06-27 04:00:00  928.04  4.83
1 2017-06-27 04:20:00  927.71  4.61
2 2017-06-27 04:40:00  928.22  4.49 

                     A       B     C
7  2017-06-27 16:20:00  662.45  1.52
8  2017-06-27 16:40:00  639.98  1.48
13 2017-06-27 19:00:00  652.10  1.51
14 2017-06-27 19:20:00  638.58  1.68
15 2017-06-27 19:40:00  633.14  1.66
16 2017-06-27 20:00:00  654.66  1.45 

                     A       B     C
9  2017-06-27 17:40:00  732.02  1.79
10 2017-06-27 18:00:00  722.63  1.98
11 2017-06-27 18:20:00  713.26  1.79
12 2017-06-27 18:40:00  705.80  1.54 

这篇关于在组和子组中拆分数据框并将输出存储在CSV文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆