为多个xlsx文件目录中的每个文件创建具有特定列总和的新工作表 [英] Create new sheet with sums of specific column for each file in directory of multiple xlsx files

查看:63
本文介绍了为多个xlsx文件目录中的每个文件创建具有特定列总和的新工作表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在目录中有许多Excel文件,每个文件的结构相同-例如,下面的数据可能是 test1.xlsx :

I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx:

Date      Type     Name      Task       Subtask       Hours
3/20/16   Type1    Name1     TaskXyz    SubtaskXYZ    1.00  
3/20/16   Type1    Name2     TaskXyz    SubtaskXYZ    2.00  
3/20/16   Type1    Name3     TaskXyz    SubtaskXYZ    1.00  

我想做的是创建一个新的Excel文件,其文件名和目录中每个文件的总和如下所示:

What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this:

File Name     Sum of hours
Test1.xlsx    4
test2.xlsx    10
...           ...

我刚刚开始使用glob,这对于创建一个像这样的大数据框很有帮助:

I just started playing around with glob, and that has been helpful for creating one large dataframe like this:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    all_data = all_data.append(df,ignore_index=True)

这对于创建一个数据表(不涉及该表的所有数据)非常有帮助,我已经能够使用groupbys在宏级别上分析数据,但是就我所知,我无法总结表格仅放入数据框之类的东西:

This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like:

task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])

在整个数据框上,我能够求和并得出相对于每个工作表的平均值.

Where on the whole dataframe i am able to sum and get a mean vs each individual sheet.

关于从何处开始的任何想法?

Any ideas on where to start with this?

推荐答案

在将文件读入内存时,应记住当前正在处理的文件名:

While you reading file into memory you should remeber filename you are currently processing:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    df['filename'] = f
    all_data = all_data.append(df,ignore_index=True)

task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])   

这篇关于为多个xlsx文件目录中的每个文件创建具有特定列总和的新工作表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆