读取特定文件夹中的所有csv文件,合并它们,然后找到最大值w.r.t.行间隔 [英] Reading all csv files at particular folder, Merge them, and find the maximum value w.r.t. row interval

查看:82
本文介绍了读取特定文件夹中的所有csv文件,合并它们,然后找到最大值w.r.t.行间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有120个文件的csv文件.它包括IndexNo,日期,EArray,温度等.

I have 120 files csv files of . It includes IndexNo, date, EArray, temperature, etc.

此处索引列的范围是1到8760. 我想从文件夹中读取所有csv文件,并将它们合并到单个文件中. 合并这些文件后,我将拥有全部IndexNo 120次(即IndexNo 1将具有120行).

Here Index Column is vary from 1 to 8760. I wants to read all csv files from folder and merge them in single file. Once I merged these files I will have all IndexNo 120 times(i.e IndexNo 1 will have 120 rows).

此后,我必须找到每个IndexNo(即IndexNo 1到8760)的EArray的最大值,然后打印该EArray最大值行.

after this I have to find the maximum value for EArray for each IndexNo (i.e. IndexNo 1 to 8760) and Print that Maximum EArray value row.

import pandas , OS, 
glob path = r'C:\Data_Input' # use your path 
all_files = glob.glob(path + "/*.csv") 
# print(all_files) 
li = [] 
for filename in all_files: 
     df = pd.read_csv(filename, skiprows=10, names=None, engine='python',header=0, encoding='unicode_escape') 
     df = df.assign(File_name=os.path.basename(filename).split('.')[0]) 
     li.append(df) 
     frame = pd.concat(li, axis=0, ignore_index=True, sort=False)


frame = frame.dropna() 
df = frame.assign(max_EArray=frame.groupby('IndexNo')['EArray'].transform('max')) df_filtered = df[df['EArray'] == df['max_EArray']] 
output = df_filtered.loc[df_filtered.ne(0).all(axis=1)]('max_EArray', axis=1) print(output.shape) 
output.to_csv('temp.csv') 

推荐答案

使用 dask (而不是纯 Pandas ),可以轻松完成您的任务.

Your task can be quite easy done using dask (instead of pure Pandas).

优点之一是开箱即用",您就有可能获得 从中读取特定行的源文件的名称.

One of advantages is that "out of the box" you have the possibility to get the name of the source file from which there has been read particular row.

我的解决方法如下:

  1. 安装 dask (如果尚未安装).

导入 dask.dataframe :

import dask.dataframe as dd

  • 定义一个用于重新格式化DataFrame的函数(分别在 从特定的 .csv 文件读取的每个部分" DataFrame):

  • Define a function to reformat the DataFrame (called individually on each "partial" DataFrame read from particular .csv file):

    def reformat(df):
        df.path = df.path.str.extract(r'/(\w+)\.\w+')
        return df[['IndexNo', 'EArray', 'path']]
    

    您可以在此处使用普通" Pandas 代码.它也会更改 path , 剥离目录路径,仅保留文件名(不带扩展名).

    Here you can use "normal" Pandas code. It changes also path, stripping the directory path, leaving only the file name (without extension).

    定义一个函数以从每个组中获取"max"行(分组后) 通过 IndexNo ):

    Define a function to get the "max" row from each group (after grouping by IndexNo):

    def getMax(grp):
        wrk = grp.reset_index(drop=True)
        ind = wrk.EArray.idxmax()
        return wrk.loc[ind, ['EArray', 'path']]
    

  • 运行实际处理:

  • Run the actual processing:

    ddf = dd.read_csv('EArray/*.csv', include_path_column=True)
    ddf = ddf.map_partitions(reformat)
    ddf = ddf.groupby('IndexNo').apply(getMax, meta={'EArray': 'i4', 'path': 'O'})
    df = ddf.compute().sort_index().reset_index()
    

  • 说明:

    • 'EArray/*.csv'-一堆源文件的规范. 我将所有源文件放在专用的子文件夹( EArray )中.
    • include_path_column=True-将 path 列添加到DataFrame中,其中包含 已读取每一行的文件的完整路径.
    • map_partitions(...)-在每个上分别调用 reformat 函数 部分" DataFrame.
    • groupby(...)apply(...)-通常,就像 Pandas 中一​​样.
    • meta- dask 中需要的附加参数(名称说明) 以及输出DataFrame中的列的类型).
    • compute()-运行由前面的指令准备的处理树. 现在的结果是正常"的 Pandas DataFrame.
    • sort_index()reset_index()- Pandas compute()的结果进行的操作.
    • 'EArray/*.csv' - specification of the bunch of source files. I put all source files in a dedicated subfolder (EArray).
    • include_path_column=True - adds path column to the DataFrame, containing full path of the file each row has been read from.
    • map_partitions(...) - call reformat function individually on each "partial" DataFrame.
    • groupby(...) and apply(...) - generally, like in Pandas.
    • meta - additional argument required in dask (specification of names and types of columns in the output DataFrame).
    • compute() - run the processing tree, prepared by the previous instructions. Now the result is "normal" Pandas DataFrame.
    • sort_index() and reset_index() - Pandas operations on the result of compute().

    为进行测试,我准备了3个 .csv 文件,每个文件有 10 行:

    For the test I prepared 3 .csv files, with 10 rows each:

    T1.csv:

       IndexNo        date  EArray
    0     1001  2019-01-01      20
    1     1002  2019-01-02      20
    2     1003  2019-01-03      20
    3     1004  2019-01-04      20
    4     1005  2019-01-05      20
    5     1006  2019-01-06      20
    6     1007  2019-01-07      20
    7     1008  2019-01-08      20
    8     1009  2019-01-09      20
    9     1010  2019-01-10      20
    

    T2.csv:

       IndexNo        date  EArray
    0     1001  2019-01-11      22
    1     1002  2019-01-12      23
    2     1003  2019-01-13      24
    3     1004  2019-01-14      25
    4     1005  2019-01-15      26
    5     1006  2019-01-16      27
    6     1007  2019-01-17      28
    7     1008  2019-01-18      29
    8     1009  2019-01-19      30
    9     1010  2019-01-20      31
    

    T3.csv:

       IndexNo        date  EArray
    0     1001  2019-01-21      35
    1     1002  2019-01-22      34
    2     1003  2019-01-23      33
    3     1004  2019-01-24      32
    4     1005  2019-01-25      31
    5     1006  2019-01-26      30
    6     1007  2019-01-27      29
    7     1008  2019-01-28      28
    8     1009  2019-01-29      28
    9     1010  2019-01-30      26
    

    我的程序的结果是:

       IndexNo  EArray path
    0     1001      35   T3
    1     1002      34   T3
    2     1003      33   T3
    3     1004      32   T3
    4     1005      31   T3
    5     1006      30   T3
    6     1007      29   T3
    7     1008      29   T2
    8     1009      30   T2
    9     1010      31   T2
    

    例如对于 IndexNo == 1001 EArray 的值为: 每个输入文件的 20 22 35 .

    E.g. for IndexNo == 1001 the values of EArray are: 20, 22 and 35 foreach input file.

    IndexNo == 1001 的结果包含:

    • EArray == 35 -上面3中的最大值,
    • T3 -包含"max"行的源文件.
    • EArray == 35 - the max value from the 3 above,
    • T3 - the source file containing the "max" row.

    我知道您将必须学习黄昏,但是我认为 值得为此付出一些努力.

    I'm aware that you will have to learn dask, but in my opinion it is worth to put some effort to do it.

    请注意,我的代码非常简洁明了. 函数只有7行,而主程序只有4行.

    Note that my code is quite clear and concise. Just 7 lines in functions and 4 lined of the main program.

    这篇关于读取特定文件夹中的所有csv文件,合并它们,然后找到最大值w.r.t.行间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆