读取特定文件夹中的所有csv文件,合并它们,然后找到最大值w.r.t.行间隔 [英] Reading all csv files at particular folder, Merge them, and find the maximum value w.r.t. row interval
问题描述
我有120个文件的csv文件.它包括IndexNo,日期,EArray,温度等.
I have 120 files csv files of . It includes IndexNo, date, EArray, temperature, etc.
此处索引列的范围是1到8760. 我想从文件夹中读取所有csv文件,并将它们合并到单个文件中. 合并这些文件后,我将拥有全部IndexNo 120次(即IndexNo 1将具有120行).
Here Index Column is vary from 1 to 8760. I wants to read all csv files from folder and merge them in single file. Once I merged these files I will have all IndexNo 120 times(i.e IndexNo 1 will have 120 rows).
此后,我必须找到每个IndexNo(即IndexNo 1到8760)的EArray的最大值,然后打印该EArray最大值行.
after this I have to find the maximum value for EArray for each IndexNo (i.e. IndexNo 1 to 8760) and Print that Maximum EArray value row.
import pandas , OS,
glob path = r'C:\Data_Input' # use your path
all_files = glob.glob(path + "/*.csv")
# print(all_files)
li = []
for filename in all_files:
df = pd.read_csv(filename, skiprows=10, names=None, engine='python',header=0, encoding='unicode_escape')
df = df.assign(File_name=os.path.basename(filename).split('.')[0])
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True, sort=False)
frame = frame.dropna()
df = frame.assign(max_EArray=frame.groupby('IndexNo')['EArray'].transform('max')) df_filtered = df[df['EArray'] == df['max_EArray']]
output = df_filtered.loc[df_filtered.ne(0).all(axis=1)]('max_EArray', axis=1) print(output.shape)
output.to_csv('temp.csv')
推荐答案
使用 dask (而不是纯 Pandas ),可以轻松完成您的任务.
Your task can be quite easy done using dask (instead of pure Pandas).
优点之一是开箱即用",您就有可能获得 从中读取特定行的源文件的名称.
One of advantages is that "out of the box" you have the possibility to get the name of the source file from which there has been read particular row.
我的解决方法如下:
-
安装 dask (如果尚未安装).
导入 dask.dataframe :
import dask.dataframe as dd
定义一个用于重新格式化DataFrame的函数(分别在 从特定的 .csv 文件读取的每个部分" DataFrame):
Define a function to reformat the DataFrame (called individually on each "partial" DataFrame read from particular .csv file):
def reformat(df):
df.path = df.path.str.extract(r'/(\w+)\.\w+')
return df[['IndexNo', 'EArray', 'path']]
您可以在此处使用普通" Pandas 代码.它也会更改 path , 剥离目录路径,仅保留文件名(不带扩展名).
Here you can use "normal" Pandas code. It changes also path, stripping the directory path, leaving only the file name (without extension).
定义一个函数以从每个组中获取"max"行(分组后) 通过 IndexNo ):
Define a function to get the "max" row from each group (after grouping by IndexNo):
def getMax(grp):
wrk = grp.reset_index(drop=True)
ind = wrk.EArray.idxmax()
return wrk.loc[ind, ['EArray', 'path']]
运行实际处理:
Run the actual processing:
ddf = dd.read_csv('EArray/*.csv', include_path_column=True)
ddf = ddf.map_partitions(reformat)
ddf = ddf.groupby('IndexNo').apply(getMax, meta={'EArray': 'i4', 'path': 'O'})
df = ddf.compute().sort_index().reset_index()
说明:
-
'EArray/*.csv'
-一堆源文件的规范. 我将所有源文件放在专用的子文件夹( EArray )中. -
include_path_column=True
-将 path 列添加到DataFrame中,其中包含 已读取每一行的文件的完整路径. -
map_partitions(...)
-在每个上分别调用 reformat 函数 部分" DataFrame. -
groupby(...)
和apply(...)
-通常,就像 Pandas 中一样. -
meta
- dask 中需要的附加参数(名称说明) 以及输出DataFrame中的列的类型). -
compute()
-运行由前面的指令准备的处理树. 现在的结果是正常"的 Pandas DataFrame. -
sort_index()
和reset_index()
- Pandas 对 compute()的结果进行的操作.
'EArray/*.csv'
- specification of the bunch of source files. I put all source files in a dedicated subfolder (EArray).include_path_column=True
- adds path column to the DataFrame, containing full path of the file each row has been read from.map_partitions(...)
- call reformat function individually on each "partial" DataFrame.groupby(...)
andapply(...)
- generally, like in Pandas.meta
- additional argument required in dask (specification of names and types of columns in the output DataFrame).compute()
- run the processing tree, prepared by the previous instructions. Now the result is "normal" Pandas DataFrame.sort_index()
andreset_index()
- Pandas operations on the result of compute().
为进行测试,我准备了3个 .csv 文件,每个文件有 10 行:
For the test I prepared 3 .csv files, with 10 rows each:
T1.csv:
IndexNo date EArray
0 1001 2019-01-01 20
1 1002 2019-01-02 20
2 1003 2019-01-03 20
3 1004 2019-01-04 20
4 1005 2019-01-05 20
5 1006 2019-01-06 20
6 1007 2019-01-07 20
7 1008 2019-01-08 20
8 1009 2019-01-09 20
9 1010 2019-01-10 20
T2.csv:
IndexNo date EArray
0 1001 2019-01-11 22
1 1002 2019-01-12 23
2 1003 2019-01-13 24
3 1004 2019-01-14 25
4 1005 2019-01-15 26
5 1006 2019-01-16 27
6 1007 2019-01-17 28
7 1008 2019-01-18 29
8 1009 2019-01-19 30
9 1010 2019-01-20 31
T3.csv:
IndexNo date EArray
0 1001 2019-01-21 35
1 1002 2019-01-22 34
2 1003 2019-01-23 33
3 1004 2019-01-24 32
4 1005 2019-01-25 31
5 1006 2019-01-26 30
6 1007 2019-01-27 29
7 1008 2019-01-28 28
8 1009 2019-01-29 28
9 1010 2019-01-30 26
我的程序的结果是:
IndexNo EArray path
0 1001 35 T3
1 1002 34 T3
2 1003 33 T3
3 1004 32 T3
4 1005 31 T3
5 1006 30 T3
6 1007 29 T3
7 1008 29 T2
8 1009 30 T2
9 1010 31 T2
例如对于 IndexNo == 1001 , EArray 的值为: 每个输入文件的 20 , 22 和 35 .
E.g. for IndexNo == 1001 the values of EArray are: 20, 22 and 35 foreach input file.
IndexNo == 1001 的结果包含:
- EArray == 35 -上面3中的最大值,
- T3 -包含"max"行的源文件.
- EArray == 35 - the max value from the 3 above,
- T3 - the source file containing the "max" row.
我知道您将必须学习黄昏,但是我认为 值得为此付出一些努力.
I'm aware that you will have to learn dask, but in my opinion it is worth to put some effort to do it.
请注意,我的代码非常简洁明了. 函数只有7行,而主程序只有4行.
Note that my code is quite clear and concise. Just 7 lines in functions and 4 lined of the main program.
这篇关于读取特定文件夹中的所有csv文件,合并它们,然后找到最大值w.r.t.行间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!