如果它们的索引相同,则合并同一数据框中的两行吗? [英] Merge two rows in the same Dataframe if their index is the same?

查看:57
本文介绍了如果它们的索引相同,则合并同一数据框中的两行吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过从Azure数据库中提取数据来创建了一个大型数据框.数据框的构造并不简单,因为我必须分部分进行,使用concat函数将新列添加到从数据库中拉出的数据集中.

I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.

这很好用,但是我按输入日期建立索引,并且在连接时有时会得到两个具有相同索引的数据行.我可以合并具有相同索引的行吗?我在网上搜索解决方案,但我总是遇到一些示例,这些示例试图合并两个单独的数据框,而不是合并同一数据框内的行.

This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.

                      Col1  Col2
2015-10-27 22:22:31   1400  
2015-10-27 22:22:31         50.5

对此

                      Col1  Col2
2015-10-27 22:22:31   1400  50.5

我尝试在索引上使用groupby函数,但这只是搞砸了.大部分数据列消失了,吐出了一些非常大的数字.

I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.

数据具有这种格式,除了具有更多列之外,而且通常是稀疏的!

The data is in this sort of format, except with many more columns and is generally quite sparse!

                        Col1    Col2    ...    Col_n-1 Col_n    
2015-10-27 21:15:60+0   1220        
2015-10-27 21:25:4+0    1420        
2015-10-27 21:28:8+0    1410        
2015-10-27 21:37:10+0           51.5    
2015-10-27 21:37:11+0   1500        
2015-10-27 21:46:14+0           51  
2015-10-27 21:46:15+0   1390        
2015-10-27 21:55:19+0   1370        
2015-10-27 22:04:24+0   1450        
2015-10-27 22:13:28+0   1350        
2015-10-27 22:22:31+0   1400        
2015-10-27 22:22:31+0           50.5
2015-10-27 22:25:33+0   1300        
2015-10-27 22:29:42+0                   ...    1900 
2015-10-27 22:29:42+0                                  63       
2015-10-27 22:34:36+0   1280        

推荐答案

对于任何有兴趣的人-我最终将自己的函数编写为:

For anyone interested - I ended up writing my own function to:

  1. 遍历数据框
  2. 通过记录索引来记录需要合并的行
  3. 汇总或平均所有行中的值
  4. 删除需要合并的每个集合中除一行以外的所有行,将其值替换为聚合或平均值(取决于我的需要)

代码:

def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
    '''
    Filter Dataframe to merge lines which are within the same time block.
    i.e. being part of the same x number of seconds, weeks, months... 

    data:
        Dataframe to filter.

    timeBlock_type:
        Time period with which to group data rows. This can be data per:
            SECONDS, DAYS, MILLISECONDS

    timeBlock_factor:
        Number of timeBlock types to group on.
    ''' 

    pd.options.mode.chained_assignment = None  # default='warn'

    tBt = timeBlock_type.upper()
    tBf = timeBlock_factor

    if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
        roundType = 'SECONDS'
    elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
        roundType = 'MINUTES'
    elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
        roundType = 'MILLISECONDS'
    elif tBt == 'WEEK' or tBt == 'WEEKS':
        roundType = 'WEEKS'
    else:
        raise ValueError ('Invalid time block type entered')

    numElements = len(data.columns)
    anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
    delIndex = []
    mergeCount = 0
    av_agg_arr = np.zeros([1,numElements], dtype=float)

    #Cycling through dataframe to get averages and note which rows to delete
    for i, row in data.iterrows(): #i is the index value, from 0
        backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
        data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.

        if (backDate > anchorValue): #if data should be grouped
            delIndex.pop() #remove last index as this is the final row to use
            delIndex.append(i) #add current row so that it isnt missed.
            print('collate')
            if mergeCount != 0:
                av_agg_arr = av_agg_arr/mergeCount
                for idx in range(1,numElements-1):
                    if isinstance(row.values[idx],float):
                        data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group

            anchorValue = backDate
            mergeCount = 0

            # Re-initialising aggregates and passing in current row values.
            av_agg_arr = av_agg_arr - av_agg_arr 
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
        else: #else if data is still part of same datetime group
            for idx in range(1,numElements-1):
                if isinstance(row.values[idx],float):
                    if not pd.isnull(row.values[idx]):
                        av_agg_arr[0,idx] += row.values[idx]
            mergeCount += 1
            delIndex.append(i) #picking out index value of row

    data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
    data.reset_index()

    pd.options.mode.chained_assignment = 'warn'  # default='warn'
    return data

这篇关于如果它们的索引相同,则合并同一数据框中的两行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆