列上的操作多个文件Pandas [英] Operations on Columns multiple files Pandas

查看:223
本文介绍了列上的操作多个文件Pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Python Pandas中执行一些算术运算,并将结果合并到一个文件中。

  Path_1:File_1.csv,File_2.csv,.... 

这个路径有几个文件,应该在时间间隔增加。

  File_1.csv | File_2.csv 
Nos,12:00:00 | Nos:12:30:00

123,1451 485,5464
656,4544 456,4865
853,5484 658,4584

Path_2 :Master_1.csv

Nos,00:00:00
123,2000
485,1500
656,1000
853,2500
456,4500
658,5000

我试图阅读 n 的数量 .csv c> col [1] 标题时间序列 col [last] master_1.csv



如果 Master_1.csv 没有这个时间,它应该创建一个新列,其中包含 c> col col c> path_1 .csv 1]



如果 col ['Nos'] 时,从 path_1文件然后将 NAN 替换为与 col ['Nos'] 相减的值。



ie



Master_1.csv中的预期输出

  Nos,00:00 :00,12:00:00,12:30:00,
123,2000,549,NAN,
485,1500,NAN,3964,
656,1000,3544,NAN
853,2500,2984,NAN
456,4500,NAN,365
658,5000,NAN,-416

我可以理解算术计算,但是我不能在 Nos timeseries 我试图把一些代码在一起,并试图解决循环。在这方面需要帮助。感谢

  import pandas as pd 
import numpy as np

path_1 ='/
path_2 ='/'

df_1 = pd.read_csv(os.path_1('/.* csv'),Index = None,columns = ['NO','timeseries'] #times系列在每个文件中都不同,例如:12:00,12:30,17:30等
df_2 = pd.read_csv('master_1.csv',Index = None,columns = ['Nos' 00:00:00'])#00:00:00时间系列

用于df_1和df_2中的号码:
df_1 ['Nos'] = df_2 ['Nos']
new_tseries = df_2 ['00:00:00'] - df_1 ['timeseries']

merged.concat('master_1.csv',Index = None,columns = ['Nos' '00:00','new_tseries'],axis = 0)#new_timeseries是每个.csv文件从path_1获得的动态时间序列


解决方案

您可以通过三个步骤进行:



  1. 将数据框合并在一起(相当于SQL左连接或Excel VLOOKUP

  2. 计算您的派生


  3. 以下是您可以尝试的一些代码:

      #read dataframes into a list 
    import glob
    L = []
    在glob.glob中的fname(path_1 +'*。csv'):
    L.append(df.read_csv(fname))

    #read主数据帧,并在其他数据框架中合并
    df_2 = pd.read_csv('master_1.csv')
    for df in L:
    df_2 = pd.merge(df_2,df,on ='Nos',how ='left')

    每列的计算差异主列
    df_2.apply(lambda x:x - df_2 ['00:00:00'])


    I am trying to perform a some arithmetic operations in Python Pandas and merge the result in one of the file.

    Path_1: File_1.csv, File_2.csv, ....
    

    This path has several file which are supposed to be increasing in time intervals. with the following columns

        File_1.csv    |  File_2.csv
        Nos,12:00:00  |  Nos,12:30:00
    
        123,1451         485,5464
        656,4544         456,4865
        853,5484         658,4584
    
    Path_2: Master_1.csv
    
    Nos,00:00:00
    123,2000
    485,1500
    656,1000
    853,2500
    456,4500
    658,5000
    

    I am trying to read the n number of .csv files from Path_1 and compare the col[1] header timeseries with col[last] timeseries of Master_1.csv.

    If Master_1.csv does not have that time it should create a new column with timeseries from path_1 .csv files and update the values with respect col['Nos'] while subtracting them from col[1] of Master_1.csv.

    If the col with time from path_1 file is present then look for col['Nos'] and then replace the NAN with the subtracted values respect to that col['Nos'].

    i.e.

    Expected Output in Master_1.csv

    Nos,00:00:00,12:00:00,12:30:00,
        123,2000,549,NAN,
        485,1500,NAN,3964,
        656,1000,3544,NAN
        853,2500,2984,NAN
        456,4500,NAN,365
        658,5000,NAN,-416
    

    I can understand the arithmetic calculations but I am not able to loop in with respect to Nos and timeseries I have tried to put some code together and trying to work around looping. Need help in that context. Thanks

    import pandas as pd 
    import numpy as np
    
    path_1 = '/'
    path_2 = '/'
    
    df_1 = pd.read_csv(os.path_1('/.*csv'), Index=None, columns=['Nos', 'timeseries'] #times series is different in every file eg: 12:00, 12:30, 17:30 etc
    df_2 = pd.read_csv('master_1.csv', Index=None, columns=['Nos', '00:00:00']) #00:00:00 time series
    
    for Nos in df_1 and df_2:
        df_1['Nos'] = df_2['Nos']
        new_tseries = df_2['00:00:00'] - df_1['timeseries']
    
    merged.concat('master_1.csv', Index=None, columns=['Nos', '00:00:00', 'new_tseries'], axis=0) # new_timeseries is the dynamic time series that every .csv file will have from path_1
    

    解决方案

    You can do it in three steps

    1. Read your csv's in to a list of dataframes
    2. Merge the dataframes together (equivalent to a SQL left join or an Excel VLOOKUP
    3. Calculate your derived columns using a vectorized subtraction.

    Here's some code you could try:

    #read dataframes into a list
    import glob
    L = []
    for fname in glob.glob(path_1+'*.csv'):
       L.append(df.read_csv(fname))
    
    #read master dataframe, and merge in other dataframes
    df_2 = pd.read_csv('master_1.csv')
    for df in L:
       df_2 = pd.merge(df_2,df, on = 'Nos', how = 'left')
    
    #for each column, caluculate the difference with the master column
    df_2.apply(lambda x: x - df_2['00:00:00'])
    

    这篇关于列上的操作多个文件Pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆